home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Libris Britannia 4
/
science library(b).zip
/
science library(b)
/
MATHEMAT
/
STATISTI
/
H201.ZIP
/
AP30_H.AZF
/
HELP.HTT
< prev
Wrap
Text File
|
1993-12-25
|
318KB
|
7,008 lines
Arcus Pro-Stat Help: |CONTENTS|
¬<Introduction>╪496 ¬
¬<Basics>╪17430 ¬
¬<Data Management>╪35334 ¬
¬<Database Manager>╪59749 ¬
¬<Analysis>╪77148 ¬
¬<Algebraic Calculator>╪288806 ¬
¬<Setup>╪9267 ¬
¬<Technical Information>╪3186 ¬
¬<Appendices>╪292864 ¬
¬<Reference List>╪310584 ¬
¬<Help>╪297385 ¬
This is the hypertext help system for Arcus Pro-Stat version 3. If you are not
sure how to use this system then please press F1 now.
|Introduction|
Arcus is a general statistical analysis package which has been developed for
use in biomedical research. It has also found popularity in education and many
branches of commerce. The Arcus project was started because the aims listed
below were not met by any other software package for the PC. Arcus has now
developed a style of its own and a world wide reputation for making statistical
analysis more approachable. As we develop the Arcus project the following aims
continue to direct our work.
1. A collection of the most commonly used statistical procedures built on
robust modern methodology to achieve accuracy and to avoid the
compromise of approximation wherever possible.
2. A user friendly approach which is intuitive and which requires little
reference to printed literature.
3. A detailed coverage of the statistical procedures which are done badly
or not at all by other statistical packages.
4. A toolbox of basic statistical procedures which are useful in research
but are seldom found in easily accessible forms in other statistical
packages.
5. A project for which the primary objective is not financial but is a
dedication to the excellence of the product. This project is to be
supported indefinitely.
Since the conception of the Arcus project in 1988 there has been a commitment
to provide facilities which the users request and, most importantly, to present
these facilities in a way which is user friendly. These objectives are often
difficult to apply to statistical analysis but after much consultation with
Arcus users it has been possible to develop interfaces which are intuitively
simple to use. As a registered user you are now entitled to submit suggestions
for the development of the Arcus project. If you are a member of an
organisation which has a site licence for Arcus then please make your
suggestions through one representative. If you have any problems with this
software or suggestions of new features for future versions then you are most
welcome to write to us. Please make clear reference to published literature
in all correspondence concerning statistical calculation or inference.
Newsletters keep the Arcus user informed of developments in the project and
you are invited to submit articles concerning any aspect of statistical
analysis, computing or your application of Arcus.
All correspondence should be sent to:
Dr Iain E. Buchan,
Medical Computing,
83, Turnpike Road,
Aughton,
West Lancashire, L39 3LD.
UNITED KINGDOM
Tel (0)695 424 034
Fax (0)51 256 7001
|Technical Information|
Arcus requires at least 448k of free memory (i.e. 640k + disk based DOS) in a
286 or better system running or emulating MS DOS version 3.30 or later. MS DOS
version 5 and above enhances Arcus Pro-Stat by providing more memory and
executing the code faster than pervious versions of DOS. If you have extended
memory configured as expanded memory using a driver such as EMM.EXE or
EMM386.EXE then Arcus Pro-Stat will use this to improve the overall efficiency
of the package. Further enhancements in operation speed are afforded by using
a disk cache system such as SMARTDRV.SYS supplied with MS DOS.
The number of data points which Arcus can hold at any one time is memory which
your computer has free. This is reflected in the storage capacity of the
worksheet. When you start an Arcus session the number of cells which the
worksheet can contain is a function of the amount of addressable free memory
divided between 50 columns. You can reset the column limit and then the
maximum number of rows is determined by free memory. The total data storage
capacity is greatest on a well configured 486 or Pentium with expanded memory.
Arcus Pro-Stat will run faster in the presence of a mathematical co-processor
because the burden of floating point maths is taken away from the program code
which emulates a co-processor in the absence of one. Some calculations and
sorting/ranking procedures will run up to five times faster. 486 DX and
Pentium systems have floating point co-processors as standard.
Please note that Arcus now requires at least a 286 processor. It will not run
on old 8086, 8088, V20 or V30 systems.
Microsoft's mouse driver (MOUSE.COM) is supplied on installation disk one, this
should be tried if you experience problems with your existing mouse driver
software.
Arcus graphics screen modes are selected by an internal system analysis routine
(Autoselect) but this may be overridden by an option in the ¬setup╪9267 ¬ menu. Due
to the wide diversity of video cards available Arcus can not be guaranteed to
display every screen perfectly but it has been tested with CGA, EGA, VGA, MCGA
and Hercules. If you have any problems with Arcus graphics then try using
different user defined screen selections.
In order to display Arcus graphics with a Hercules monochrome graphics adapter
you will need to have loaded the MSHERC.COM program before starting the main
Arcus program. Install handles this for you by inserting the line MSHERC.COM
into the ARCUS.BAT file which loads Hercules support routines when a Hercules
monochrome adapter is detected.
The graphics provided in Arcus can be used for presentation if you have a
PostScript printer. The other printer options, Hewlett Packard Laserjet and
Epson FX, are simple screen dumps which are intended for instant visual
analysis only. If you do not have a PostScript compatible printer then you can
save Arcus PostScript graphics files to disk and have them printed out on a
PostScript system at a later date.
Most results screens, including the pictorial statistics selections which are
marked with a hash(#) in the menu, use only standard ASCII characters so that you
can obtain a hard copy using any line printer. This is achieved by pressing P
or E when results are displayed. Please do not use the print screen key. Once
you have pressed P or E you enter the Arcus screen editor; the screen will turn
to inverse video (black on white) and you have an opportunity to annotate the
results before they are sent to the printer or to a log file on disk (please
refer to ¬Basics╪17430 ¬). The printing routines are designed to keep a paper record
of the work done in your Arcus work sessions and they operate most efficiently
with continuous or sheet-fed stationery. For uninterrupted output please be
sure to set the lines per page option in the setup menu, this defines the
number of lines which your printer fits on one page.
If you experience a problem of Arcus Pro-Stat "hanging up" (i.e. no response
from the keyboard) then please make sure that you have avoided the following
situations. Firstly you must not use Arcus Pro-Stat on a computer which runs
anything less than a 286 processor. Secondly you must remove unnecessary TSR
(terminate and stay resident) programs before running Arcus Pro-Stat. Very few
TSR's cause problems but I have come across some rogue public domain and early
freebie system utilities which cause problems with code that conforms strictly
to Microsoft standards. Examples of these rogues are KEYBUK.EXE and SPEED.SYS.
Please use the MS DOS KEYB.COM routine in place of KEYBUK.EXE and do not use
SPEED.SYS. Please do not use any non-standard DOS components, especially
replacements for COMMAD.COM. One DOS component which can cause strange
looking screens is ANSI.SYS and the MODE.COM PAGE settings. Try removing
these from the CONFIG.SYS file, they are not used by good software and they
take up memory.
If you are a Microsoft Windows user then please note that you can use the
clipboard to paste results screens to other applications if you have installed
Arcus Pro-Stat as DOS application in Windows running in enhanced mode. Please
remember that Arcus must be started via the ARCUS.BAT batch file, therefore you
must specify ARCUS.BAT as the command line when installing Arcus as a DOS
application in the Windows environment. DO NOT LET WINDOWS INSTALL ARCUS
AS A DOS APPLICATION WITH THE COMMAND LINE ARCUS_.EXE, IT MUST BE ARCUS.BAT
INSTEAD! Arcus Pro-Stat takes advantage of some memory management features in
Windows even though it is run as a DOS application.
Arcus Pro-Stat has been developed using Microsoft FORTRAN version 5.1, Microsoft
BASIC Professional Development System version 7.1 and Microsoft Macro Assembler
with all complied code linked by Blinker version 3.0. All executable code
conforms to LIM (Lotus Intel Microsoft) standards and will take advantage of
LIM 4 expanded memory if present.
|Setup|
¬<Data File Path>╪10092 ¬
¬<Printer Port>╪10623 ¬
¬<Lines per Page>╪10854 ¬
¬<Graphics Printer>╪11630 ¬
¬<Graphics Screen>╪12540 ¬
¬<Mouse Sensitivity>╪12882 ¬
¬<Screen Colours>╪13127 ¬
Some information about your computer hardware and preferences is kept in memory
for Arcus to refer to. This information is stored in a setup file called
ARCUS.SET which you will find in the Arcus program directory. Do not attempt
to alter this file externally. All setup information is configured via the
setup menu. When you are happy with the information you have specified then
you can update the ARCUS.SET file by selecting "save new settings". If the
ARCUS.SET file is accidentally lost then you are forced through this setup
procedure when you being an Arcus session.
|Data File Path|
This is the disk location where Arcus worksheet files are to be stored. If you
followed the default installation procedure on hard disk drive C then this
location will be C:\ARCUS\DATA. Using the \DATA sub-directory off the \ARCUS
directory is logical, you are advised to keep your hard disk structure as simple
as possible. There are, however, circumstances such as network use when you
would rather use a removable disk for data storage. If this is the case then
simply enter the drive path A:\.
|Printer Port|
This refers to the parallel printer port that you want to use for Arcus print
-outs. Most computers have at least one of these ports, designated LPT1, LPT2
etc. You can not select a serial port (COM1 ect).
|Lines per Page|
This tells Arcus how many lines of text your printer fits on one page. It will
vary with, font, line spacing and paper size. Choose the lines per page figure
which is appropriate to your printer when first switched on. If you do not set
this information properly then Arcus will put page breaks in the wrong place.
This will cause printing over perforations or odd looking sheet fed print-outs
with large gaps.
If you are a Laserjet user then you can select the number of lines per page on
the printer as well as in Arcus setup. You are advised to select a small font
so that you save paper.
If you are a PostScript user then forget this option, it is set automatically
for you when you select PostScript as the graphics printer type.
|Graphics Printer|
Arcus treats printed graphics in one of two ways. The first is a simple screen
dump for instant visual analysis only and the second is high quality output for
presentation. The only target for presentation quality graphics is PostScript.
PostScript was chosen for Arcus as it is a portable and versatile language.
PostScript output from Arcus can be directly to a printer or to an encapsulated
PostScript file (EPS) on disk. You can use this EPS file as a graphic figure
in most word processing documents intended for a PostScript printer.
Simple screen dumps are provided for Hewlett Packard Laserjet and Epson FX
compatible printers. You can select the resolution and orientation of the
output. Please remember that these screen dumps are not intended for
presentation, if you need presentation output then please consider a PostScript
cartridge for your Laserjet.
|Graphics Screen|
Arcus can detect the best setting for most graphics adapters when you have set
this option to "Autoselect". There are, however, exceptions so you are given
the option of forcing Arcus to use a particular graphics mode. You can not use
a video mode if it is not supported by your video card (see hardware manual).
|Mouse Sensitivity|
This sets the amount of mouse movement needed to shift the cursor. Thus a low
setting reqires less movement of the mouse to move the cursor i.e. more
sensitive. Settings are 1 to 100, most rodents prefer around 20.
|Screen Colours|
This section enables you to select colours for various categories of text. Some
Arcus screen colours can not be changed. A black background has been chosen
quite deliberately, this is to minimise the ambient radiation. Radiation from
monitors may not prove to be a significant problem but why take the chance?
|Save New Settings|
This saves the settings in the rest of this section to a file called ARCUS.SET.
Unless you save your settings in this file they will not take effect next time
you start Arcus.
|Return to Previous Menu|
This is the "step back" button in the Arcus menu system. It is also achieved
by pressing the Esc key or the right mouse button.
|Windows|
If you are a Microsoft Windows user then you should consider running Arcus as
a DOS application from within Windows. When running Arcus within Windows in
386 enhanced mode you can paste Arcus results screens into the Windows
clipboard for subsequent use in Windows applications. This is done by pressing
Alt + Enter when Arcus is running, at this point you have a window of Arcus
within the Windows environment. You might find the best results with the font
set at 10 x 16. From the pull down menu of this window you select edit and copy
to grab marked text or graphics from the Arcus window. This is then available
in the clipboard for pasting into Windows applications. You can run Arcus
Pro-Stat from a window within Windows but this is not advisable as it slows down
all screen writing processes. You can not initiate Arcus graphics when running
Arcus in a window within Windows, this requires full screen operation. See also
"¬Technical Information╪3186 ¬".
|DOS Shell|
This option provides acess to all of your other programs without loosing any
of the Arcus information you are working with. The memory overhead is just
4k bytes therefore you have enough memory left to run virtually any
application. When you select this option you are presented with the DOS
prompt from which you can issue all of the commands you could before you
started Arcus. To return to the current Arcus session you simply enter
EXIT at the DOS prompt.
¬<Windows>╪13832 ¬
|Developer's Notes|
Running Arcus in a Shell:
Free memory required = at least 384k
Calling convention = ARCUS.BAT (NOT ARCUS_.exe !!!)
Expanded memory = desirable but not essential (used for overlays)
Automatic file loading:
You can export text files from your application and execute Arcus with the
exported information already loaded and the starting position within Arcus
already defined.
Arcus files have the following structure:
Z%,"date of saving","description of contents"
"name of column 1", J1%
"name of column 2", J2%
dc1r1!
dc1r2!
dc1r3!
dc2r1!
dc2r2!
dc2r3!
Key:
Z% = number of variables (columns in worksheet, above it would be 2)
JX% = number of data (rows) in worksheet column X, as an integer
dcxry! = datum for column x, row y, as a single precision real number (thus
the data are read down and columns across the sheet from left to right)
The following is an actual Arcus worksheet file:
3,"29-03-1993","Arcus sample file"
"col 1 ",3
"col 2 ",3
"col 3 ",3
1
2
3
1
2
3
1
2
3
This file follows the following structure:
number of variables, date, description of file
name of variable, number of data in variable
repeat for no of variables...
data read down each variable in turn...
To start an Arcus session with a file called TEST already loaded you would use
the command line ARCUS TEST or ARCUS.BAT TEST. Please note that the opening
credit screen is skipped if you opt for automatic file loading on start up.
The full command line options are: ARCUS /F$ /X% /R% /L$
Key:
F$ = file to load on starting
X% = code for starting locus 9 = data management menu
12 = worksheet
1 = analysis menu
0 = main menu
R% = current printer row
L$ = log file name
If you have any questions then please do not hesitate to contact me:
Iain E. Buchan,
Medical Computing,
83, Turnpike Road,
Aughton,
West Lancs L39 3LD.
TEL UK (0)695 424 034
FAX UK (0)51 256 7001
The |Basics|
The Arcus user interface consists of plain text on a dark background. Menu
selections are text icons of keys which can be pressed to select those menu
items. Alternatively the cursor keys or a mouse can be used to move the
highlighted menu selection to the required item which is then selected by
pressing the enter key or the left mouse button. The menu system is a
branching structure. Moving backward to a previous menu is achieved by
pressing the escape key, selecting its icon or by pressing the right mouse
button. The mouse options will only work if a mouse is present, mouse driver
software is active and the mouse sensitivity has been defined in the setup menu.
The menu system is accompanied by a context sensitive hypertext help system.
Help screens are called up by pressing F1 or the middle mouse button (if you
have a three button mouse). Each help screen is relevant to the menu item
which is currently highlighted. A "Statistical Method Selection" section also
provides information within Arcus. This facility will attempt to find the best
test for your data but please remember that it is not a panacea of statistical
methodology (ref 2). If you have any doubt about the best method for your data
you should try to consult a statistician and you should most certainly consult
a reputable text book. This hypertext manual discusses the functionality of
Arcus Pro-Stat but gives only a brief outline of the statistical methods used.
For further statistical information I recommend that you seek out the references
listed as Core Texts in the ¬reference list╪310584 ¬. A list of good introductory texts
is also provided in the reference section.
¬Confidence intervals╪31897 ¬ (CI) are increasingly used in statistical inference.
Particular effort has been made to allow Arcus to address this valuable trend.
Wherever possible the most exact method for the CI has been used. Before
calculation of a CI a screen is displayed to enable you to select a coefficient
of confidence. Short-cut key strokes are given for the commonly used confidence
levels, for example pressing the enter key will set a 95% confidence level for
the calculation which follows. You are also given the opportunity to enter
your own confidence coefficient.
Some of the Arcus functions are time consuming. When a process is taking an
appreciable amount of time you are usually given a warning message. Please do
not assume that the program has "crashed", this is highly unlikely. The most
time consuming functions are the Lotus work file link, the calculation of exact
probability for the Mann-Whitney U statistic in the presence of tied data and
sample sizes for the comparison of means.
Hard copies of results from a printer are obtained by pressing the P key when
results screens are displayed. The ¬setup╪9267 ¬ menu and printer must be carefully
configured.
A flexible print routine, the Arcus screen editor, is invoked by pressing P or
E when results screens are displayed. This allows you to annotate a text screen
then send the results to a printer or to a log file on disk. The screen editor
accepts standard edit key combinations:
Ctrl+N Insert a line
Ctrl+Y Delete a line
Ctrl+P Embed a character
Ctrl+Page Up Move to top of text
Ctrl+Page Down Move to bottom of text
If you save your results to a log file then you have a text file of results
from the current Arcus session on disk. This text file can be examined and
printed subsequently using the log file editor listed in the data management
menu, it can also be imported by word processing software. The name of the log
file is composed of the day, the month, the number of the Arcus session on that
day (one log file per session) and it is given the extension ".LOG".
Throughout Arcus the word variable is used to refer to a column of numbers in
the worksheet. These columns represent groups of data which can be investigated
via the analysis section. Any of the analyses which do not require columns of
data from the worksheet are listed under the Instant Functions section. This
section includes distribution functions and methods for contingency table
analysis.
|Essentials|
Arcus aims to provide a user-friendly interface to statistical methods. This
aim presents two major hurdles, the first is the ease of use of the software
itself and the second is the level of assumed knowledge of statistics. The
development of Arcus has focused on providing basic statistical methods in an
intuitively simple package. One could say that statistical software should not
be used by people who do not understand "statistics" and therefore justify a
high level of assumed knowledge in statistical software. We do, however, live
in the real world where people forget statistical principles learned in the past
but need to apply them to their research. If Arcus can facilitate appropriate
statistical design, analysis and inference by combining text and tools then the
Arcus project will have achieved its objectives.
If you are an experienced statistician then you will find useful functions in
Arcus which are absent or awkward to use in other statistical packages. It is
quicker to process data using Arcus on your desktop and only resort to SAS or
Genstat etc if you need a function which is not covered by the present version
of Arcus.
If you are an infrequent user of statistical methods then here is an approach
you might find useful. Consider your research as a sequence of actions:
planning, data collection, data preparation and description, further analysis
and presentation. You are the expert in the questions you are investigating
so you MUST think long and hard about these questions BEFORE you start the
research. Then consider how you can analyse any data you collect. Ask yourself,
will I be able to answer the questions I am asking or does my study leave itself
open to criticisms such as too many confounding variables?. In this situation
you might need more control over your experimental conditions if this is
possible. Sample size estimation is a difficult area for the uninitiated,
Arcus provides sample size calcualations but I would advise you to seek
statistical advice at this stage. A short time with a statistician at the
planning stage can save a lot of misdirected time and effort in the long run.
BEFORE you see the statistician you must have thought carefully about the
nature, collectability, controllability and appropriateness of the data you
plan to collect. If you go prepared you will get better answers faster.
There is a statistical method selection section within Arcus but it deals with
only the most basic statistical analyses. You are asked a series of questions
about your study and you are given the most appropriate hypothesis test to use
provided you are asking one of the simple questions covered by this section.
Remember that the most simple questions often provide the most powerful answers.
In some ways this function is an over simplification and you MUST NOT rely upon
it for planning important studies. It is, however, useful for preparing
yourself before you see a statistician. It will get you thinking along the
right lines and thus make it easier for you to communicate your ideas to the
statistician.
Once you have a basic plan of action you can start to prepare your data for
entry into Arcus. You have three main options 1) make a database 2) put
data directly into the worksheet 3) put data directly into non-worksheet
functions. The latter refers to simple situations such as the contingency
tables in the instant functions section of Arcus. More arduous number entry
tasks are made easier by using a keyboard with a number pad. Most Arcus users
will enter their data into the worksheet. This involves preparing columns of
numbers where each column represents a different group. For more information
please see ¬<Arcus Worksheet>╪36264 ¬. Please note that the help text for each analysis
function gives you information on how to prepare your data. Some users might
wish to make a database from which they can select information for export to
the Arcus worksheet. This is often the easiest approach to questionnaires.
For more information please see the ¬<Database Manager>╪59749 ¬.
The next stage is to look at your data. Are there any odd looking results and
if so, why are they odd? Then describe your data using ¬<Descriptive Statistics>╪80612 ¬.
If you are happy with the questions you were asking before you started the study
then go on to apply the hypothesis test which you planned at the outset. It
may be that there is no appropriate "test", you should establish your analytical
plan at the start of the study, taking statistical advice if necessary. NEVER
sift through various tests trying to get p<0.05 this is not difficult to detect
and makes you look very unprofessional. If you do not understand why this is
so then please see ¬<p values>╪29175 ¬. The inferences you make from your statistical
analyses require knowledge of both the statistical principles used and the
biological relevance of the numerical conclusions.
The last step is presentation. You might have a well conducted and well analysed
study which falls down on presentation. Here are a few basic pointers: Present
raw data where possible, use graphs if they can show something important, do not
duplicate data (e.g. tables and text), do not present parametric and non
-parametric descriptive statistics together, use the asterisk rating system for
¬<p values>╪29175 ¬ and use ¬<confidence intervals>╪31897 ¬ in discussion.
Summary: Think long and hard about questions
± Try Arcus statistical method selection
± Try Arcus sample size calculations
± Consult a statistician
See the help text for the chosen Arcus functions
Analyse and save results to a log file and/or paper output
± Transfer data to a graphics package
Prepare report quoting Arcus version number and references
|Interacting with Arcus|
Arcus uses a plain text screen with a title bar at the top. The menus are lists
of keys which you can press to select a menu title if you do not have an easier
way of selecting menu items. This occurs with some portable computers where the
cursor keys are awkwardly placed. If you have a good keyboard then select menu
items using the cursor keys and the enter key. The escape key moves you back
a menu. If you have a mouse then move the highlighted menu selection using the
mouse and accept your selection by pressing the left mouse button. The right
mouse button moves you back a menu.
Within an Arcus menu you can acess special functions using keys which are not
displayed on the screen:
F1 or Alt+H calls up help text that is relevant to the currently highlighted
menu title.
Alt+P or Alt+E in the help system or results screens invokes the Arcus screen
editor which can be used to annotate text screens then print them or save them
to the active log file.
Alt+N calls up the Arcus notepad on which you can jot down ideas and save them
to the active log file or to the printer.
If you are having problems with your mouse then please make sure that you are
using a standard mouse driver such as Microsoft's MOUSE.COM or MOUSE.SYS. You
do not have to use the mouse driver software which came with your mouse.
Microsoft's MOUSE.COM is supplied on Arcus installation disk one.
|P Values|
The p value or critical level is the probability of rejecting the null
hypothesis (Ho) when it is true.
The null hypothesis is most often the hypothesis of "no difference" e.g. no
difference between mean blood pressure in group A and group B. This should have
been considered before the start of your study. If you expect results to be in
one direction only then you have a one tailed test. More often you can not be
certain that the results can go in one direction only, you must therefore use
a two tailed p value.
If your p value is less than the chosen significance level then you reject the
null hypothesis i.e. accept that your sample gives reasonable evidence of a
population difference for the parameters you have observed. It does NOT imply
a "meaningful" or "important" difference, that is for you to decide when
considering the biological relevance of your result.
The choice of significance level at which you reject the Ho is arbitrary.
Traditionally the 5%, 1% and 0.1% (p< 0.05, 0.01 and 0.001) regions have been
used. These numbers tend to give a false sense of security when in reality
there are many factors which contribute to the arbitrary nature of these levels.
In the ideal world we would be able to define a "perfectly" random sample, the
most appropriate test and one definitive conclusion. We simply can not. What
we can do is try to optimise all stages of our research to minimise sources of
uncertainty. When presenting p values it is good practice to use the asterisk
rating system:
p < 0.05 *
p < 0.01 **
p < 0.001 ***
Some authors quote statistically significant as p < 0.05 and statistically
highly significant as p < 0.001. The asterisk system conveys more information
and avoids the woolly term "significant".
At this point, a word about error. Type I error is the false rejection of the
null hypothesis and type II error is the false acceptance of the null
hypothesis. As an aid memoir: think that our cynical society rejects before
it accepts.
The significance level (α) is the probability of type I error. The power of a
test is one minus the probability of type II error. Power should be maximised
when selecting statistical methods. If you want to estimate sample sizes then
you must understand all of the terms I have mentioned here.
You might be interested in further details of probability and sampling theory
at this point. There are a number of good ¬introductory texts╪310834 ¬.
You must understand ¬confidence intervals╪31897 ¬ if you intend to quote p values. You
are encouraged to quote confidence intervals by all good journals.
|Confidence Intervals|
A confidence interval (CI) for a population parameter is the interval in which
the unknown true population value for this parameter is assumed, with a certain
probability, to lie. This probability is arbitrary, 95% (0.95) is the most
commonly chosen value.
The parameter in question can be a mean, difference between two means, a
proportion etc. The CI included with each Arcus function is discussed in the
help text for that function. The interval is often symmetrical about the
parameter but this is not necessarily so. In some studies wider or narrower
confidence intervals will be required. This rather depends upon the nature
of your study. I would advise you to consult a statistician if you plan to
use "non-standard" CI's.
A word about terminology: You will hear the terms confidence interval and
confidence limit used. The confidence interval is the range Q-X to Q+Y where
Q is our parameter and Q-X is the lower confidence limit and Q+Y is the upper
confidence limit.
|Julian Numbers|
The Julian period began on January 1st 4713 BC. The Julian number of a date
represents the number of days since the start of the Julian period. These
numbers are a useful way of representing dates because the arithmetic difference
between two Julian numbers is the exact number of days between the two dates
they represent. The Gregorian calendar which we use provides no year zero
between 1 BC and 1 AD, so Julian number 1 corresponds to the 25th of November
4713 BC. Please note that you can use BC dates in the worksheet when it is in
date mode but you can not use BC dates in the Arcus database manager.
|Statistical Method Selection|
This section provides a simple decision tree for selecting statistical methods
appropriate to your data. Please note that the advice given is only a rough
guide to methods appropriate to your investigation. Only the simpler
experimental designs are covered. If you require a fuller appreciation of the
statistical methods that are appropriate to your investigation then you are
strongly advised to consult a reputable text or a statistician. A common fault
is to read an article which is related to your work and repeat the methods that
have been used by the authors; do not assume that all journals weed out bad
statistical methods!
¬<Measurement Scales>╪34375 ¬
¬<Essentials>╪21747 ¬
¬<Analysis>╪77148 ¬
¬<Reference List>╪310584 ¬
|Measurement Scales|
Before you plan the statistical approach to your investigation you must
understand the nature of the variables you are studying. Different variables
have different mathematical characteristics which usually require different
types of analysis. Please familiarise yourself with the following measurement
scales:
INTERVAL
■ Scale with a fixed and defined interval eg temperature or time.
ORDINAL
■ Scale for ordering subjects from low to high with any ties attributed
to lack of measurement sensitivity eg. pain score from a questionnaire.
NOMINAL with order
■ Scale for grouping into categories with order eg. mild, moderate
or severe. This can be difficult to separate from ordinal.
NOMINAL without order
■ Scale for grouping into unique categories eg. blood group.
DICHOTOMOUS
■ As for 4 but two categories only eg. surgery / no surgery.
¬<Essentials>╪21747 ¬
¬<Reference List>╪310584 ¬
|Data Management|
¬<Arcus Worksheet>╪36264 ¬
¬<Worksheet files>╪44501 ¬
¬<ASCII & Lotus link>╪51843 ¬
¬<Log file editor>╪57897 ¬
Most Arcus analyses require data which have been prepared in rows and columns.
This section provides you with a worksheet with which to edit these data and
other functions which import / export data to / from the worksheet. There is
also a complete database management system which can be used to edit data in
"forms", this is often the easiest approach when processing questionnaire data.
If you need to process small numbers of data, such as contingency tables, then
you do not need to enter these data via the worksheet. All such functions ,
which are listed in the analysis menu under "Instant Functions", ask you for
the data they require after you have selected the function. These data are
entered directly in response to instructions on the screen.
|Arcus Worksheet|
The Arcus worksheet can be thought of as a computerised sheet of paper which
holds numbers in rows and columns. This is, however, a rather advanced sheet
of paper with many editing functions and the ability to interpret formulae as
you enter them.
Superficially this worksheet resembles many of the well known spreadsheets
but there are some important differences. Unlike spreadsheets the Arcus
worksheet has been optimised for the preparation of data for statistical
analysis. It does not hold any character data apart from the column labels.
You may enter formulae in a cell (an individual element of a column) but these
formulae are immediately translated into their numeric results. If you want to
transform all of the data in a column by applying a formula to them then simply
press Alt+F. Likewise if you need to create a new column of data as a function
of one or more other columns then you can do so by pressing Alt+Q.
The cursor control keys have the following actions in the worksheet:
arrow right - go one cell to the right
arrow left - go one cell to the left
arrow up - go one cell up
arrow down - go one cell down
home - go to top of current column
ctrl + home - go to top of the first column
end - go to the last entry in the current column
ctrl + end - go to the top of the last column which contains data
Alt + G - go to the column name of your choice
Unlike most spreadsheets the Arcus Worksheet uses the mouse as a pure cursor
locator. There are no scroll bars to aim at you simply move the cursor using
the mouse and the sheet will shift across if you move past the limit of the
screen. When you try to move the cursor beyond the limit of the sheet itself
you will see a red "LIMIT" sign flash at the cursor location. If you try to
move past the right hand limit of the worksheet then you will be asked whether
or not you wish to extend the worksheet by another column. If there is not
enough memory available to extend the worksheet in this way then the operation
is aborted with a beep. If you start a new sheet knowing that you require more
than the standard 50 columns then you can extend the worksheet to a specified
number of columns using the ¬Reset Parameters╪50354 ¬ selection of the data management
menu. The maximum number of columns per worksheet is 1,000 and the row limit
is 25,000. Please note that resetting the column limit to a small number
increases the maximum size of each column.
Numbers are entered in the worksheet by pressing any combination of
alphanumeric keys followed by the enter key. You can enter numbers or formulae
at the cell editing line. For example 8/SQR(16) would put the solution 2 into
that cell. These formulae are for instant interpretation only, you can not
embed them in a cell of the worksheet and you can not use other cell locators
(e.g. A1 for column 1 row 1) as used by most spreadsheets. The functions which
the cell editor can interpret are listed below and this information is available
in a help screen which is invoked by pressing the F1 key when you are editing
a cell.
Constants: PI
EE as e
ABS absolute value
CLOG common (base 10) logarithm
CEXP anti log (base 10)
EXP anti log (base e)
LOG natural (base e, Naperian) logarithm
SQR square root
! factorial (max 34)
LN! log factorial
IZ normal deviate for a p value
UZ upper tail p for a normal deviate
LZ lower tail p for a normal deviate
^ exponentiation (to the power of)
+ addition
- subtraction
* multiplication
/ division
\ integer division
ARCCOS arc cosine
ARCCOSH arc hyperbolic cosine
ARCCOT arc cotangent
ARCCOTH arc hyperbolic cotangent
ARCCSC arc cosecant
ARCCSCH arc hyperbolic cosecant
ARCTANH arc hyperbolic tangent
ARCSEC arc secant
ARCSECH arc hyperbolic secant
ARCSIN arc sine
ARCSINH arc hyperbolic sine
ATN arc tangent
COS cosine
COT cotangent
COTH hyperbolic cotangent
CSC cosecant
CSCH hyperbolic cosecant
SINH hyperbolic sine
SECH hyperbolic secant
SEC secant
TAN tangent
TANH hyperbolic tangent
AND logical AND
NOT logical NOT
OR logical OR
< less than
= equal to
> greater than
If you enter a cell in a column which has empty cells above the current
location then the gaps above are automatically filled with missing data values.
The worksheet editing mode is indicated by a "Norm" or "Date" sign at the top
left hand corner. The date editing mode allows you to enter conventional dates
in the European day/month/year format. These entries are stored as Julian
integers in the worksheet but the highlighted cursor location always shows the
conventional date interpretation of the Julian number. Please note that the
difference between two Julian numbers is the exact number of days between the
two dates from which these numbers are derived.
The saving and loading of worksheet data to/from disk takes place outside the
worksheet itself. You will see the relevant functions listed in the data
management menu under ¬Worksheet Files╪44501 ¬.
Labelling of columns is achieved using the key combination Alt+N or Alt+L.
Other special keys which are active in the worksheet are:
F1 help screen
Alt+I insert a cell at the current cursor location
Alt+C insert a column at the current cursor location
Alt+D delete the cell at the current cursor location
Del delete the cell at the current cursor location
Alt+X delete the current column
Alt+Z delete the current row
Alt+N enter or edit a column name
{When you are editing column names you can press TAB / Shift+TAB
to move directly to the next / previous column name.}
Alt+B copy a block from the current column to another column
Alt+T toggle between normal and date editing mode
Alt+P print all rows of selected columns
Alt+S display current column statistics
Alt+G go to a selected column
Alt+F apply a formula to the current column
Alt+R put ¬random numbers╪254271 ¬ into the current column
Alt+F apply a formula to the current column
Alt+Q make a new column as a function of other columns
Space bar enter a missing data value (3.456789E+33 displayed as *)
As in most of Arcus the mouse buttons emulate the enter and escape keys. Thus
the right mouse button (Esc) exits the worksheet and the left mouse button
(Enter) accepts any data you have typed at the current cell then moves down a
cell. Some spreadsheets move the cursor to the right when you press enter but
Arcus moves down. This is quite deliberate as most people prefer to enter
numeric data in columns not rows.
A word about indicator variables. Arcus uses indicator variables for survival
analysis. All other functions require you to provide data from different
groups in different columns. Some stats packages such as SAS use a column of
1's, 2's etc to indicate which group the entry in that row of the data column
belongs to. This is the indicator variable system which Arcus uses for
survival analysis. All other functions ask you for a separate column of data
for each group.
Arcus uses 3.456789E+33 as its missing data value and in all instances this is
displayed as an asterisk (*). This is an internal constant which you do not
need to remember, a cell within the spreadsheet is marked as a missing
observation by pressing the space bar. In subsequent calculations these values
are skipped and all values in a row containing a missing data value are skipped
if the variables are grouped, e.g. matched pairs.
|Worksheet Files|
This section enables you to retrieve worksheet data which have been stored on
disk using the ¬Save Worksheet╪47791 ¬ function of this menu.
The standard location for Arcus worksheet files is a sub-directory called \DATA\
off your Arcus directory. If the standard setup has been used for an Arcus
installation on drive C then the full data file path is C:\ARCUS\DATA. Arcus
worksheet files do not use any special extension (the letters after the point
in the file name). You can use any naming system you want. These worksheet
files also have a very simple structure, they are stored in ASCII text. This
simple structure has the benefit of enabling other developers to read and write
Arcus worksheet files easily. This allows other applications, such as custom
databases, to select data for export then write them into an Arcus file.
If you are a developer then please see ¬Developer's Notes╪15349 ¬ for more information
about the Arcus worksheet file structure.
You can load more than one worksheet file from disk into the current worksheet.
This enables very large worksheets to be created from a number of smaller ones.
The process is ultimately limited by the column limit of 1,000 or the amount of
memory your computer has free.
If you want to change the standard data file location then please see ¬Setup╪9267 ¬.
|Arcus File Finder|
Arcus uses the following protocol to search through disks for files. You are
shown a list of titles which you can select using the cursor keys and enter key
or by using the mouse. Disk drives, directories, sub-directories and files are
displayed differently:
[-A-] <----this moves you to drive A
[-B-]
[\ARCUS] <----this moves you to directory \ARCUS
[\DOS]
IO.SYS
AUTOEXEC.BAT
CONFIG.SYS
if we select <ARCUS> then <DATA> you might see:
[..] <----this moves you back to the directory \ARCUS
MYDATA
RAT1
SURVEY2 <----this selects the file SURVEY2
Please note that you can jump to files beginning with a certain letter by
pressing that letter on the keyboard when the file list is displayed.
|Arcus Data File Path|
This function enables you to select a worksheet file which has been stored in
the standard Arcus data file location. If you installed Arcus on drive C
using the default paths then this location will be C:\ARCUS\DATA.
The files are presented to you in alphabetical order. If there are many files
to sift through then press the first letter of the file name you are looking
for. This causes the selection bar to jump to files beginning with that letter.
The mouse can also be used to select files. The left hand mouse button or the
enter key makes the selection. The Esc key or the right hand mouse button
quits the file selector without loading a file.
¬<Arcus File Finder>╪45889 ¬
¬<Data File Path>╪10092 ¬
|Select Path|
This function enables you to bypass the standard Arcus data file location and
specify your own path to a worksheet file. This situation might arise when you
have a particular file on floppy disk. To examine the contents of a file in
drive A just enter the path A:\.
¬<Arcus File Finder>╪45889 ¬
¬<Data File Path>╪10092 ¬
|Save Worksheet|
This function enables you to save all of the data in the worksheet to a file on
disk. You are asked to specify the name of this file. No special extensions
are added to this name and you can use your own extension if you wish. Try to
adopt a simple naming system which you can recognise easily. Please note that
you are presented with file names listed in alphabetical order when you come to
recall worksheet files from disk.
The location for storage of worksheet files is also under your control. Arcus
prompts you with the standard data storage path defined when you installed Arcus
e.g. C:\ARCUS\DATA. If this is acceptable then just press the enter key. If
you wish to divert this file, say to a floppy disk, then type in the relevant
path e.g. A:\ for the A drive. If you want to change the standard data storage
path then you can do so via ¬setup╪9267 ¬.
Arcus saves its worksheet files using a very simple text file structure. This
allows software developers to read and write Arcus data files easily. If you
are a developer then please refer to ¬developer's notes╪15349 ¬.
A full description of each worksheet, up to 150 characters, can be added to
each file. You are prompted for this, just press enter if it is not required.
If you change your worksheet and forget to save it then you will be prompted to
do so on finishing the current Arcus session.
|Save Rotated Worksheet|
This is a special function for those who wish to rotate an Arcus worksheet.
Example:
1.. 2.. 3..
1 1.1 0.7 1
2 1.5 0.6 2
3 1.6 0.6 3
4 1.8 0.5 4
...this would become:
1.. 2.. 3.. 4..
1 1.1 1.5 1.6 1.8
2 0.7 0.6 0.6 0.5
3 1 2 3 4
... in other words rows become columns and columns become rows.
The file extension ".ROT" is appended to your file name. Column names are lost.
|Current Status|
This function simply displays information concerning the current worksheet and
the free memory state of your computer. The latter represents the number of
kilobytes of memory which Arcus can use for data storage and processing.
The time and date displays depend upon you having set these parameters properly.
To change your computer's time or date, just shell out to DOS and enter them
using TIME and DATE commands. Note that times are entered as 14:30:00 and dates
are entered as 09-12-93. If your computer is not maintaining times and dates
then its backup battery is probably flat.
|Reset Parameters|
This section provides you with the ability to wipe clean the current worksheet.
For this reason you must be careful with these functions!
"New Worksheet (50 columns)"
This first selection simply wipes the worksheet leaving an empty 50 column
sheet.
"New Worksheet (user defined columns)"
This second selection wipes the current worksheet and you select the column
limit for the new worksheet. There are two main reasons for setting the
column limit. The first is when you know that you will need more than 50
columns and you do not want to be prompted to extend the sheet each time you
try to pass the column limit. Secondly you might need a very long column
length on a computer with limited memory. To maximise column length in this
situation you must select a small column limit. The absolute maxima are 1,000
columns and 25,000 rows.
"Reset Printer"
This selection enables you to reset the printer line counter. If you have a
Laserjet or PostScript printer then this is automatically resets the printer
as well as the line counter within Arcus. If you have any other line printer
then you will need to align new paper to the top row before you continue. The
function basically tells Arcus that you are starting over at a the top line.
The next page break will happen when the page length is exceeded. If you need
more information about setting up your printer for Arcus then please refer to
the ¬setup╪9267 ¬ section.
|ASCII & Lotus link|
¬<Plain ASCII file import>╪54064 ¬
¬<Formatted ASCII file import>╪55002 ¬
¬<Lotus compatible ASCII file export>╪55544 ¬
¬<Lotus compatible WK? file import>╪52500 ¬
This section deals with the transfer of data between Arcus and other
applications. Specifically, the import of data from ASCII text files and Lotus
compatible spreadsheets and the export of data to spreadsheets. Please note
that data can also be imported from database files using the ¬Database Manager╪59749 ¬.
If you are a developer wishing to read and write Arcus worksheet files then
please see ¬Developer's Notes╪15349 ¬.
|Lotus Compatible WK? File Import|
Arcus can read binary spreadsheet files which are compatible with Lotus 123
WKS or WK1 files. Applications such as Quattro, Excel and Symphony can export
these files providing you specify the correct file format. Borland's Quattro
automatically produces Lotus compatible files when you save a worksheet with
the .WKS or .WK1 file extension (do not use .WKQ).
One proviso is that you must use column labels in your original spreadsheet.
Arcus uses column labels to identify where columns begin. Once the spreadsheet
file has been read by Arcus you are given a list of column labels which have
been found. You then simply select the columns you wish to bring across as
Arcus worksheet columns. The label each column had in the spreadsheet is
maintained in the Arcus worksheet. Things can get a bit slow with large
spreadsheets therefore it is better have the spreadsheet (WK?) file on hard
disk not floppy disk.
Gaps within a spreadsheet column are interpreted as missing data. Gaps at the
end of a spreadsheet column are not interpreted unless you enter a missing data
value (3.456789E33) at the end of the column. The column label must be no more
than one gap away from the top of the column of numeric data. If you need a
larger gap at the top of a column then you must enter the Arcus missing data
value (3.456789E33) at this position in the spreadsheet.
Please note that all columns are transferred individually and are appended to
the current worksheet if you have data there already.
|Plain ASCII File Import|
Plain ASCII file describes a simple text file which does not use any special
characters or codes for formatting. Such a file might be produced by a database
report generator or a simple text processor. This Arcus function enables you
to pick out columns of numbers from such a file and load them into the current
worksheet.
Please use only plain text in these files, tabs and other formatting characters
make it difficult to define columns.
You pick out columns of numbers by selecting start, width and end points on the
screen. For this purpose Arcus displays the text file on screen. If your file
is greater than 80 columns then you are asked to define which horizontal
section of the file you wish to search. Gaps or non-numeric text are treated
as missing data.
Importing data in this way can be quite irksome, therefore, you should
consider other methods for frequent imports.
|Formatted ASCII File Import|
Some applications output data in text files which use spaces or commas to
delimit data. One such application is FigP.
Consider the file:
1.2,1.3,8
1.5,1,8
1.7,1.0,9
1.7,1.5,10
..this would import into an Arcus worksheet as:
1... 2... 3...
1 1.2 1.3 8
2 1.5 1 8
3 1.7 1 9
4 1.7 1.5 10
NB Do NOT use spaces AND commas to separate your data, use EITHER commas OR
spaces!. Do NOT use column titles in the text file.
|Lotus Compatible ASCII File Export|
All good spreadsheets can read comma and quote delimited text files. Column
titles are contained within quotes and numeric data are separated by commas.
Consider the text file:
"Age","Urea","Creatinine"
65,6.5,101
23,3.4,65
44,4,80
..this would export to a spreadsheet as:
Age Urea Creatinine
1 65 6.5 101
2 23 3.4 65
3 44 4 80
Arcus does not export WK1, WKS, WKQ or any other binary spreadsheet files
because there is no point when all good spreadsheets can read these simple
portable comma and quote delimited text files.
|Select Data|
This function enables you to select data from a worksheet column which meet
certain criteria that you define. It also enables you to pick out selected
data and change them. There are two basic uses of this function which we
shall look at by example:
1. Aim:
To select all patients over 65 and their serum creatinines.
Source:
A column of ages and a column of creatinines from a group of 100 patients.
Action:
a. Select from column AGE.
Match from column CREATININE.
Expression is >65.
Choose "create new variable".
b. Select from column AGE.
Match from column AGE.
Expression is >65.
Choose "create new variable".
Result:
Two new columns have been appended to the worksheet, one with ages over 65
and another with creatinine values for all the over 65's which match the
ages in the other new column.
2. Aim:
To replace certain values in a worksheet column. You might need this if
you have imported data from an application which uses a different missing
data value to Arcus.
Source:
Any column with unwanted data.
Action:
Select from this column.
Choose "replace values".
Specify the value to replace (eg -999).
Specify the value to replace it with (eg 3.456789E33 the Arcus missing data
value).
Result:
All -999's become 3.456789E33 (* in the Arcus worksheet ie missing data).
Please note that the Arcus Database Manager can also be used to select out data
before you import it to the Arcus worksheet. For more information on this
please see ¬Record Selection╪69953 ¬.
|Log File Editor|
If you use the Arcus screen editor (invoked by pressing P or E) and choose the
"save to log file" option (F2) then you will have a log file for that Arcus
session saved in the Arcus data sub-directory. Each new Arcus session uses a
separate log file name, this is composed of the day, the month and the number
of the Arcus session on that day, i.e. 1201_3.LOG would be the log file from
the third Arcus session in which a log file was used on the twelfth of January.
This function provides a simple text editor with which you can examine and edit
the content of any text file. It also enables you to send this text to a
printer. If you require more powerful editing functions then please use your
familiar word processor. Note that you can run your word processor within Arcus
by shelling out to DOS, there is no need to finish your current Arcus session.
The cursor location in the Arcus log file editor can be controlled using the
cursor keys or the mouse and the left mouse button. The right mouse button and
the Esc key quit the editor. The editor accepts standard key combinations:
Ctrl+N Insert a line
Ctrl+Y Delete a line
Ctrl+P Embed a character
Ctrl+Page Up Move to top of text
Ctrl+Page Down Move to bottom of text
If you want to enter a character which is not represented on your keyboard then
you can do so by holding down the left Alt key whilst tapping out the ASCII code
of that character on the right hand number pad (if present). For example, the
code Alt + 224 gives the letter alpha. A list of these decimal codes is given
under ¬<ASCII Codes>╪294887 ¬.
If you intend to import Arcus log files into word processing software, try to
specify small font sizes so that you avoid unwanted parsing of lines.
Arcus |Database Manager|
This provides a facility for creating and maintaining databases which are file
compatible with dBase III plus, dBase IV, dBXL/Quicksilver, FoxPro, FoxBase or
Clipper. It also enables you to import database fields as Arcus variables.
Help prompts are provided in addition to the hypertext help which is invoked
by pressing the F1 key. Help menus are also available via the F1 key within
most functions. If you have even a vague idea of how databases work then you
will find this part of Arcus Pro-Stat intuitively simple. If you are not
familiar with database management systems then you may wish to read
"¬Data-Basics╪74457 ¬".
One notation convention which you should be aware of is the caret sign ^
followed by a key, this indicates that a combination of the Ctrl key plus that
key should be pressed (i.e. ^Home is Ctrl + Home).
For information about supported file structures, limits, record selection and
other technical data then please refer to the ¬Database Technical Information╪66897 ¬
section.
If you need to maintain complex multi-relational databases with elaborate
reporting systems then you should select one of the dedicated database
management systems and use the Arcus Database Manager as a link between this
and the Arcus Worksheet. Please make sure that your database manager can
export files which are readable by Arcus. Most database managers can export
files in different formats. The database file formats which Arcus can read are
dBASE III, dBASE IV, FoxPro, FoxBase, dBXL/Quicksilver and Clipper.
|Open Database File|
The first step in using this database manager is to open or create then open
a database file. Arcus searches for files with the DBF extension and displays
summary information about each compatible database file in the chosen sub-
directory. The database file types which can be handled by this database
manager are dBASE III, dBASE IV, FoxPro, FoxBase, dBXL/Quicksilver and Clipper.
¬<Create New Database File>╪64240 ¬
¬<Arcus File Finder>╪45889 ¬
|Open Index File|
This function allows you to open an index file which has been made for the
database file which is open. Arcus searches for index files with the NDX
extension and displays summary information about each compatible index file in
the chosen sub-directory. The index file types which can be handled by this
database manager are dBASE III, dBASE IV, FoxPro, FoxBase, dBXL/Quicksilver
and Clipper.
¬<Index or Re-Index Database File>╪65035 ¬
¬<Arcus File Finder>╪45889 ¬
|Copy Data to Another File|
This function enables you to make new database files or new Arcus Worksheet
files from records in the active database file. Both of these links allow you
to be selective in the choice of records and the fields from each of these
records which you copy to the new file. Please note that field names will be
transferred to Arcus data files as worksheet column (variable) labels.
|Browse and Edit Database|
The browse & edit option presents your database in a worksheet format with
fields as columns and records as rows. You can use this option to inspect
and edit the existing records of the active database file. Please note that
an index file for the active database will be updated when you edit records
provided it has been opened, any other inactive index files on disk will not
be updated. If you have several widely spaced fields to edit then you should
use the rearrange fields option to collect these fields onto one screen before
editing. If you wish to replace or remove records then please use the delete
marker in the browse & edit function followed by the remove deleted records
function within the pack/purge records option. If you wish to add new records
then please use the append records option.
|Append New Records|
This function enables you to load data into the active database file. The enter
key is used to confirm the input for a particular field but you must use F3 to
accept the entire record and move onto the next. Familiarity with the function
keys will facilitate easy use of this function.
Please note that the date entry format is DD/MM/YY(YY) but a date is stored as
YYYYMMDD in the database file. It is the YYYYMMDD format which is displayed in
the "browse & edit function". If you use the "copy data to another file"
function then all dates will be translated into Julian numbers.
|Create New Database File|
The first stage in making a new database file is to create a template using this
function. This template defines how your data will be stored in the database
file on disk. If you specify very large fields then the database manager will
allocate more disk space per record. This can lead to much disk space being
gobbled up by wasted space, please consider this when defining fields. Arcus
supports dBASE III, dBASE IV, FoxPro, Clipper, FoxBase and dBXL/QuickSilver
database file formats. Some formats permit larger field sizes and/or number
of fields per record, please see "¬Database Technical Information╪66897 ¬" for more
details about this.
N.B. To make a new database active you must next select it via the "Open
Database File" option.
|Index or Re-Index Database File|
You can use index files with all Arcus database files. Index files define how
you look at the records in your active database file. One database file may
have many index files so that you can look at records in different orders and
separately. For example, if you had an age field in your database file you
could use an index file based on all records to display them in age order.
You could also select only those records falling within a certain age range.
Please note that you must renew the index file each time you edit the parent
database file, this is done via the create index file option. Once you have
created an index file you must use the "select index file" function in order
to make it active.
|Modify Database Structure|
This function enables you to add, remove, shorten, lengthen or rename the
fields of an existing database file. It then refills the redefined file
structure with the records from the original database file. Field data is
truncated or padded as necessary. Please be careful not to loose data by
imprudent use of this function. It is wise to make a copy of your original
database file via the "copy data..." option before experimenting with this
option. Arcus does, however, make a backup file (name.bak) of your original
database file (name.dbf) when performing this function.
|Pack or Purge Records|
This function will compact an existing database file so that it takes up less
disk space and can be read more efficiently. The purge procedure removes all
records which have been marked as deleted so please use it with caution.
|Print Report|
This option enables you send selected database fields and records to a printer.
The target printer port and the number of lines per page are defined via the
setup menu in the main Arcus module.
|Database Technical Information|
¬<Record Selection Functions>╪69953 ¬
Limits
~~~~~~
Maximum file size = 4.2 billion bytes
Record limits depend upon the file type selected:
--> dBASE III/III+
max record length: 4095
max no of fields: 128
field types: character 1-254
numeric 1-19 (0 to 15 decimal places)
logical 1
date 8
memo 10
--> dBASE IV
max record length: 4000
max no of fields: 255
field types: character 1-254
numeric 1-20 (0 to field length-2 decimal places)
floating 1-20 (0 to field length-2 decimal places)
logical 1
date 8
memo 10
--> FoxPro 1.0/2.0
max record length: 4000
max no of fields: 255
field types: character 1-254
numeric 1-20 (0 to field length-2 decimal places)
floating 1-20 (0 to field length-2 decimal places)
logical 1
date 8
memo 10
--> Clipper '87 5.0
max record length: 8192
max no of fields: 1023
field types: character 1-2048
numeric 1-30 (0 to 13 decimal places)
logical 1
date 8
memo 10
--> dBXL, QuickSilver
max record length: 4000
max no of fields: 512
field types: character 1-254
numeric 1-19 (0 to 15 decimal places)
logical 1
date 8
memo 10
--> FoxBase 1.0/2.0
max record length: 4000
max no of fields: 128
field types: character 1-254
numeric 1-19 (0 to 15 decimal places)
logical 1
date 8
memo 10
Date entry
~~~~~~~~~~
Please note that ArcusDB stores date fields in the format YYYYMMDD without any
separators. This format is used in the browse & edit and print report sections.
The append records section, however, uses the DD/MM/Y(YYY) format to accept
initial input of dates; the Arcus worksheet uses this date entry system also. ArcusDB
does NOT convert dates to Julian numbers for the database files but does convert
them to Julian numbers when you export them as an Arcus worksheet file as this
is the date storage format in the Arcus worksheet.
Using Arcus Database Manager Independently
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The database module is designed to be called from the Arcus data management
menu but it can be run without the main Arcus module. If you wish to run it
independently then you must supply some parameters at the command line:
ARCUSDB /?a/0/?b/?c/0/?d/ where ?a is the data storage path
(e.g. C:\ARCUS\DATA\)-(NB DO NOT forget the last back slash here), ?b is the
printer port (e.g. 1), ?c is the number of lines which your printer fits on one
page (e.g. 64) and ?d is the mouse sensitivity (e.g. 30). This can be put into
a batch file. Further details are available on request.
|Record Selection Functions|
Where S = string, N = Numeric, L = logical, D = Date:
ABS(N) absolute value
{ABS(5-11) is 6 not -6}
ASC(S) ASCII value of first character
{ASC("Abacus") is 65}
AT(S1, S2) character position of S2 within S1
{AT("Hello World", "or") is 8}
CAPS(S) capitalise the first letter of each word
{CAPS("GOOD DAY") is "Good Day"}
CHR(N) ASCII character
{CHR(65) is "A"}
DATE$ current system date
DELETED() returns "T" if record is deleted
(i.e. asterisk as first character) and "F" if it is not)
IIF(X1, X2, X3) returns X2 if X1 is true else returns X3
{IIF(AGE>=65, "Ger", "Med") is "Ger" for age=78}
INSTR(N, S1, S2) character position of S2 within S1 starting at N
{INSTR(5, "Hello World", "l") is 10}
INT(N) rounded down to the nearest integer
{INT(3.5) is 3)
JULIAN(D) returns the julian number of the date;
this number is in character format
LEFT(S, N) left N characters of S
{LEFT("Pioneer", 2) is "Pi"}
LEN(S) length if S
{LEN("Pioneer") is 7}
LOWER(S) lower case
{LOWER("HELLO") is hello}
LPAD(S1, N, S2) pad S1 to N characters with S2 at the left
{LPAD("Hello", 12, "H") is "HHHHHHHHello"}
LTRIM(S) cut leading blanks
{LTRIM(" Here ") is "Here "}
MAX(N1, N2) maximum of N1 and N2
{MAX(21, 21.01) is 21.01}
MID$(S, N1, N2) extract N2 characters from S starting at N1
{MID$("Hello",2,1) is "e"}
MIN(N1, N2) minimum of N1 and N2
{MIN(21, 21.01 is 21}
RECNO() current record number
RECORD() full content of current record in one string
REPLICATE(S, N) N replicates of S
{REPLICATE(".", 3) is "..."}
RIGHT(S, N) right N characters of S
{RIGHT("Pioneer", 2) is "er"}
RPAD(S1, N, S2) pad S1 to N characters with S2 at the right
{LPAD("Hello", 12, "*") is "Hello*******"}
RTRIM(S) cut trailing blanks
{LTRIM(" Here ") is " Here"}
SPACE(N) N blanks
{SPACE(3) is " "}
STRING$(N, S/N) N repetitions of S or ASCII(N)
{SRING$(3, 88) is "XXX")
STR(N1, N2, (N3)) string of N1 of length N2 with N3 decimal places
{STR(2.341, 6, 4) is 2.3410}
SUBSTR(S,N1,N2) extract N2 characters from S starting at N1
{SUBSTR("Hello",2,2) is "el"}
TIME() eight character string of current time
TRIM(S) strip leading and trailing blanks
{TRIM(" Here ") is "Here"}
UPPER(S) convert to upper case
{UPPER("Hello") is "HELLO"}
VAL(S) numerical value of string
{VAL("34") is 34.0}
Record Selection Operators
~~~~~~~~~~~~~~~~~~~~~~~~~~
= equal to
<> not equal to
< less than
> greater than
>= greater than or equal to
<= less than or equal to
Boolean Operators
~~~~~~~~~~~~~~~~~
.AND. true if both expressions are true
.OR. true if one expression is true
.NOT. opposite truth of expression
Concatenation Symbols
~~~~~~~~~~~~~~~~~~~~~
+ combine expression
- subtract expression
Literals
~~~~~~~~
[] explicit expression not a field name
Examples
~~~~~~~~
.NOT. PAID
- gives records where the logical field paid is false (F/N)
PAID .AND. AGE < 30
- gives records where the logical field paid is true (T/Y) for ages under 30
AGE >= 25 .AND. AGE < 30
- gives records for the age 25 to 30 range where age is a numeric field.
VAL(AGE) >= 25 .AND. VAL(AGE) < 30
- gives records for the age 25 to 30 range where age is a character field.
MID$(UPPER(NAME), 1, 1) = "A"
- gives records where name begins with the letter A.
ASC(MID$(UPPER(NAME), 1, 1)) >= 65 .AND. ASC(MID$(UPPER(NAME), 1, 1)) < 73
- gives records with names from A to H (see Appendix Three for ASCII codes).
|DATA-BASICS|
Think of a database as a section of a filing cabinet. The database manager
enables you to create a special form for that section and to control the data
which are contained in each form. A form contains one record. A record
contains pieces of information, such as name, age, sex etc., in separate boxes
called fields. The type of field depends upon the type of data it has been
designed to accept, e.g. 10 characters or a number with 3 decimal places. All
of this field information is defined when you create a new database file. The
resulting template is then used to admit information to successive records.
Arcus allows you to change this basic structure even after you have put
information into the database file.
The term "report" refers to information taken from the records in the database
file for inspection on screen or print-out. This information consists of the
fields and records which you specify. That brings us to another important term
"record selection". Arcus uses the dBASE language to define your conditions for
selecting the records which you want to look at. For example, you might want to
consider only those aged 65 or over. In this case you enter the selection term
as AGE >= 65 providing you have a field called AGE. These selection expressions
can be highly complex, for more details see ¬<record selection functions>╪69953 ¬.
You might have heard the term "relational database". This refers to the way in
which several sections of our filing cabinet communicate or "relate". Say we
had basic patient details in one section and information from a study in another
then a link between the two. This link is a special field, such as case sheet
number, which is common to both sections/databases. The current Arcus database
manager does not provide relational operation. If you need this facility then
you should use a dedicated database management system and use Arcus database
manager as a link between this and the Arcus worksheet.
The last term we shall examine here is "index". An index is a file which keeps
track of the records in your database file. It enables you to specify an order
in which you wish to work with database records. This order refers to one field
e.g. surnames in alphabetical order. One database file can have many index
files so that you can look at the same database in different ways. If you alter
a database file in any way without having an index file open then the index
will have lost track of your database. You must, therefore, re-index the
database if you think changes have been made to the database file without
having the index file open.
|ANALYSIS|
¬<Worksheet oriented analysis>╪77578 ¬
¬<Instant functions>╪213040 ¬
The analytical functions of Arcus are divided into the two sections shown above.
Worksheet oriented functions require data which have been prepared using the
worksheet in the data management section of Arcus. Instant functions prompt
you for data when you select a function, e.g. a box to fill in a four fold
contingency table.
|Worksheet oriented analysis|
¬<Arithmetical Manipulation>╪78201 ¬
¬<Descriptive Statistics>╪80612 ¬
¬<Pictorial Statistics>╪81471 ¬
¬<Parametric Methods>╪87475 ¬
¬<Nonparametric Methods>╪98877 ¬
¬<Regression and Correlation>╪119789 ¬
¬<Analysis of Variance>╪158578 ¬
¬<Survival Analysis>╪182274 ¬
All of the analysis functions which do not require their data to have been
entered in the Arcus worksheet are described under "¬Instant Functions╪213040 ¬". Those
functions which do require previously entered data from the Arcus worksheet
are dealt with in this section.
|Arithmetical Manipulation|
This provides a selection of arithmetical treatments which can be applied to a
worksheet column (variable). For example you could apply the expression
V1 * (V1/SQR(V1)+2) to the variable V1 and the result of this equation for
each of the data in the V1 variable would be placed in a new variable. The
results are always stored in a new variable which you name, the data in the
source variable are never altered. Arcus Pro-Stat can interpret a wide range
of functions, these are identical to the cell editor functions which are
described in the ¬Arcus worksheet╪36264 ¬ section of this hypertext. You apply the
expression by entering it via the keyboard and you can call up a list of
allowable functions by pressing the F1 key when you are editing the expression.
There is also provision for you to create a new variable as a function of more
than one existing Arcus variable; V1, V2, V3 etc. For example, if you wanted
to create a column of electrical current values using Ohm's law (V = I * R)
you could select resistance as V1 and voltage as V2 then apply the expression
V2 / V1.
¬<Other Transformations>╪79380 ¬
|Other Transformations|
Please note that logit, probit, angular and cumulative transformations
are listed in this section whereas ranks, sortings and normal scores can be
obtained via the nonparametric methods section. If you request probit, logit
or angular transformation for a set of discrete data then the result for each
data point will represent the transformation of the proportion (p) of the
maximum in the variable which that point comes from. Logit transformation is
defined as LOG(p/(1-p)) and provides a way of linearizing sigmoid distributions.
Probit transformation is defined as 5 + Z(1-p) and also provides a way of
linearizing sigmoid distributions. Angular transformation uses arcsin(*p),
this provides a way of linearizing sigmoid distributions and equalising variances.
For Logit and Probit transformations indeterminable values (when p=0 or p=1)
are stored as missing data. The name for a variable resulting from one of these
transformations is the name of the source variable suffixed with ~Pr, ~Lo, ~Ag
or ~Cm as appropriate. N.B. - each time something computationally illegal,
such as the natural logarithm of zero, is requested then the result is stored
as the missing data value.
|Descriptive Statistics|
This option provides measures of location and dispersion which describe the
data in any variable. You are given the number, arithmetic mean, variance,
standard deviation, standard error of the arithmetic mean, confidence interval
for the arithmetic mean, geometric mean, coefficient of skewness, coefficient
of kurtosis, maximum, upper quartile, median, lower quartile, minimum and range
for each selected variable. You can also choose to calculate any additional
quantile and this is appended to the results listed above. Incalculable results
are displayed as missing data using an asterisk (*). Arcus uses Kendall's
definitions of skewness and kurtosis (ref 7). The relative merits of these
descriptive methods are presented clearly and concisely in Aviva Petrie's
book (ref 1).
¬<reference list>╪310584 ¬
|Pictorial Statistics|
¬<Histogram>╪82408 ¬
¬<Box and Whisker Plot>╪83131 ¬
¬<Scatter Plot>╪83837 ¬
¬<Normal Plot>╪84456 ¬
¬<Survival Plot>╪85093 ¬
¬<Error Bar Plot>╪85626 ¬
¬<Spread Plot>╪86257 ¬
¬<Ladder Plot>╪86870 ¬
You can describe and relate your data graphically using these functions.
Neat scales are chosen automatically for each function and the figure is
composed using standard ASCII text characters or graphics images. High quality
presentation graphics output can be obtained from the graphics functions when
you are using a PostScript printer. Please note that you can also export
PostScript images for use in most good word processing software; however the
target printer must also be PostScript compatible. Printing is activated by
pressing P when the figure is displayed. You can annotate the ASCII graphics
before sending them to a printer or to a log file.
|Histogram (ASCII)|
The frequency distribution histograms are plotted horizontally across the screen
with the count for each division displayed at the right hand side. This function
divides your variable into x ranges between the minimum and maximum value of the
selected variable. You specify x. Arcus then selects a "neat" set of midpoints
for these ranges and draws horizontal bars to represent the number of data in
the variable which fall into each of these ranges. For less than 64 data points
per bar each asterisk (*) represents one count, above this value the bars are
proportional representations but their true values can be gleaned from the
counts display at the right hand side of the screen.
|Box and Whisker Plot|
Box and Whisker plots, described by Tukey (1977), give you a pictorial
representation of the nonparametric descriptive statistics. In Arcus Pro-Stat,
the "box" bounded by parentheses represents the distance between the first and
third quartiles with the median between them marked by an asterisk (*), with the
minimum as the origin of the leading "whisker" and with the maximum as the limit
of the trailing "whisker". This is a very good way of showing an audience the
spread of your data, it is much easier to convey than a dry list of
nonparametric descriptive statistics. The graphics based version of this plot
is intended for PostScript presentation graphics.
|Scatter Plot (ASCII) & (Graphic)|
This function plots a Y axis (ordinate) variable against an X axis (abscissa)
variable. The scale selection for the axes is automatic. Superimposed plot
points are displayed as the number of plot points at one screen location
provided this number is less than 10. If more than 9 plot points lie at one
screen location then it is marked with the letter X. The graphics based version
of this function allows you to display up to four series which are displayed
using different marker styles for each series and you can opt to display
joining lines between the markers.
|Normal Plot (ASCII)|
The normal plot uses the same physical plotting procedures as the ASCII text
based scattergram but you select only one variable which is plotted against
its normal scores. Normal scores are calculated as Z((2k-1)/2n) where k is the
rank of a datum in your variable, n is the number of data and Z is a quantile
from the standard normal distribution. The linearity of the resultant plot
indicates the normality of the distribution of the data in your selected
variable. For a more objective assessment of normality please use the
Shapiro-Wilk W test which is listed in the parametric methods section.
|Survival Plot|
This provides a graphics based step plot for displaying survival curves. It is
intended to be used with variables for Time on the X axis and S (the Kaplan-
Meier survivor function) on the Y axis. You can use up to four series and
high quality output is available via a PostScript printer. This is a good
accompaniment to a presentation of survival analysis which compares survival
(or time to event) data in different groups. Please see ¬Kaplan-Meier╪182964 ¬ for
more information on generating S.
|Error Bar Plot|
The high-low-close plots of business graphics packages can be difficult to
manipulate if you have to display more than one series; therefore, I have
included this function in Arcus. You can use up to four series for which you
must provide three variables for each series; the X data, the Y data and the
error function of the Y data. The error function can be, for example, the
standard error of the mean for each Y when each Y point represents the mean of
repeated observations. Different series are represented by different marker
styles and you can opt to show joining lines between the markers.
|Spread Plot|
This is a very useful way of presenting the spread of data in up to four
groups. It is one step back from the Box & Whisker plot in that it gives
an entirely pictorial representation of the spread of your data. The axis is
divided into an arbitrary number of divisions which are the width of a plot
point; if more than one datum occupies a division it is plotted alongside the
first, thus a concentration of data at a particular value is represented by a
broad band. I liken this to a "statistical electrophoresis". High quality
output is available when using a PostScript printer.
|Ladder Plot|
Arcus provides a ladder plot for the comparison of paired data from two groups.
This is a useful pictorial accompaniment to paired t and wilcoxon signed ranks
tests when the number of pairs is not too large. Each pair is joined by a line;
these lines would look like the parallel rungs of a ladder if there was little
difference between each pair. A presentation of continuous observations from
a small to medium sized population before and after an intervention is
conveniently represented by a ladder plot. High quality output is available
when using a PostScript printer.
|Parametric Methods|
¬<Tests using Student's t>╪87973 ¬
¬<Z (Normal distribution) tests>╪96200 ¬
¬<F (variance ratio) test>╪95622 ¬
¬<Shapiro-Wilk test for normality>╪97214 ¬
This section provides various hypothesis tests and descriptive functions which
assume that your data come from a normal distribution. The Shapiro-Wilk W test
is, strictly speaking, a nonparametric method but it is included in this section
because it enables you to test for "non-normality".
|Tests using Student's t|
¬<Paired t test>╪88342 ¬
¬<Single sample t test>╪91441 ¬
¬<Unpaired (two sample) t test>╪93086 ¬
Please note that Student t tests using numbers, means and standard deviations
directly instead of being calculated from worksheet columns are given in the
Student's t distribution section of the instant functions module.
|Paired t test|
The paired t test provides an hypothesis test of the difference between
population means for a pair of random samples whose differences are from an
approximately normal distribution. A confidence interval is provided for the
difference between the means and the limits of agreement are given (ref 4, 5).
EXAMPLE: Comparison of peak expiratory flow rate before and after a walk on a
cold winter's day for a random sample of 9 asthmatics. You enter two columns
in the worksheet, one of PEFR's before the walk and the other of PEFR's after
the walk. In this example each row must represent the same subject, in other
studies the data might be matched / paired in some other way.
subject before after
1 312 300
2 242 201
3 340 232
4 388 312
5 296 220
6 254 256
7 391 328
8 402 330
9 290 231
If you were to plot these pairs using a ladder plot you would see that all but
one pair decreases. You might also wish to test the assumption that the
differences are from a normal distribution, this can be done with the Shapiro
-Wilk test. If you want to create a separate column of differences then press
Alt+Q in the worksheet to create a new column as "after-before".
Then select this function with say 95% confidence level when prompted. The
results screen will show you p values and the confidence interval for the
difference between the means.
For our example:
Mean of differences = 56.1
95% CI for difference between means = 29.8 to 82.4
two tailed p = 0.0012 **
A null hypothesis of no difference between the means is clearly rejected because
the confidence interval does not include zero.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
Considering other studies where the two groups represent two different ways of
measuring the same thing or two different observers you might be interested in
the limits of agreement. These limits are displayed on the standard paired t
test results screen and an agreement plot is given after each paired t test.
These only apply to agreement studies. When two methods of measurement are
being compared it is almost always erroneous to present a scatter plot with
correlation as a measure of agreement between the paired data obtained using the
two methods of measurement. Highly correlated results often agree poorly,
indeed large shifts in measurement scales may leave the correlation coefficient
unaltered. It is therefore necessary to provide a quantification of agreement.
This is done by use of the paired t-test and limits of agreement. Arcus allows
you to select a confidence level for limits of agreement and provides an ASCII
plot of the difference against the mean for each pair of measurements. This
plot also displays the overall mean difference bounded by the limits of
agreement. A good review of this subject has been provided by Martin Bland and
Doug Altman (ref 29, 5).
|Single sample t test|
The single sample t method tests the null hypothesis that the population mean
is equal to a specified value. If this value is zero then the confidence
interval for the sample mean is given (ref 4, 5).
EXAMPLE: Consider 20 first year resident doctors drawn at random from a
regional health authority, resting systolic blood pressures measured using an
electronic sphygmomanometer were:
128 127
118 115
144 142
133 140
132 131
111 132
149 122
139 119
136 129
126 128
From previous large studies of "healthy" individuals drawn at random from the
general public (with the same male:female ratio) a resting systolic blood
pressure of 120 mm Hg was predicted as the age matched population mean. To
analyse these data in Arcus first prepare a worksheet column containing all 20.
Then select the single sample t test from the parametric methods menu of the
analysis section. Enter your population mean as 120 then run the test again
without entering a population mean.
For our example:
sample mean = 130
95% CI for difference between means (i.e. sample-population) = 5.4 to 14.7
95% CI for sample mean = 125.4 to 134.7
two tailed p = 0.0002 ***
A null hypothesis of no difference between sample and population means has
clearly been rejected. Using the 95% CI we expect the mean systolic BP for
this population of doctors to be at least 5 mm Hg greater than the age and
sex matched general public, lying somewhere between 125 and 135 mm Hg.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Unpaired (two sample) t test|
The unpaired t method tests the null hypothesis that the population means
relating to two independent, random samples from an approximately normal
distribution are equal (ref 4, 5). A confidence interval is constructed for
the difference between population means. This test must not be used if there
is a significant difference between the variances of the two samples, this is
tested for and you are given appropriate warnings. There are parametric
alternatives which have been designed to cope with the situation of unequal
variances, namely the methods due to Behrens and Welch, but the nonparametric
Mann-Whitney test is more robust.
EXAMPLE (from Armitage, ref 4 p 109): Consider the gain in weight of 19 female
rats between 28 and 84 days after birth. 12 were fed on a high protein diet
and 7 on a low protein diet:
High Protein Low Protein
134 70
146 118
104 101
119 85
124 107
161 132
107 94
83
113
129
97
123
To analyse these data in Arcus first prepare them in two worksheet columns and
label these columns appropriately. Then select the unpaired t test from the
parametric methods menu of the analysis section. Request a 95% confidence
interval (CI) by pressing the enter key when prompted.
For this example:
mean of "High Protein" = 120 g
mean of "Low Protein" = 101 g
difference between sample means = 19
95% CI for difference between population means = -2.2 to 40.2
two tailed p = 0.07
Thus we have a difference which is not quite significant at the 5% level. The
most important information is, however, conveyed by the CI. The 95% CI includes
zero therefore we can not be confident (at the 95% level) that these data show
any difference in weight gain. As most of the interval is toward weight gain
and as the test result is in the grey "suggestive" 5%-10% zone we have good
evidence for repeating this experiment with larger numbers. Bigger samples
will probably shrink the range of uncertainty so that the CI contracts to a
narrower band clearly above zero.
NB We did not consider a one tailed p here because we could not be absolutely
certain that the rats would all benefit from a high protein diet in comparison
with those on a low protein diet. They might have suffered adverse effects
from our high protein diet.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|F (variance ratio) Test|
This tests the equality of two variances from random samples which are
approximately normally distributed. Only the upper tail probability need be
considered because the larger variance is always used as the numerator in
Snedecor's variance ratio F (ref 4, 5). In most situations this probability
should be doubled to give a two tailed test. Analysis of variance can utilise
a one tailed probability because the numerator and denominator of the variance
ratio are predetermined.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Z (normal distribution) Test|
For large (n >= 50) normally distributed samples you can use this sensitive
method which is equivalent to the single sample and unpaired t tests. You may
either compare two independent random variables or compare the data in a variable
with a known population mean. Remember that with large degrees of freedom a t
distribution is approximately normal (ref 4, 5).
EXAMPLE: See the examples for t tests and consider these in the context of
larger samples.
You will gain a little more sensitivity by using the normal distribution tests
but you must have good reason to believe that your data have been drawn from a
normal distribution. The t tests are less sensitive to small deviations from
normality, so use them instead if you have any doubt. If your data are clearly
non-normal then you must use one of the nonparametric methods even if you have
large samples.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Shapiro-Wilk test for non-normality|
This test is a complex analysis of variance which can be used to test a variable
for the non-normality of its data. There must be a random sample of between 3
and 2000 data. The null hypothesis of the test is that the sample is taken from
a normal distribution, thus a significance level of < 0.05 rejects this
supposition of normality. You should not use any of the parametric methods
with variables for which W is significant. Most authors agree that this is the
most reliable quantification of normality for small to medium sample sizes
(ref 6, 21, A17, A18).
EXAMPLE (Shapiro & Wilk ref 21): Consider the following 30 penicillin yields:
0.0958 0.0002
0.0333 -0.0026
0.0293 -0.0036
0.0246 -0.0042
0.0206 -0.0113
0.0194 -0.0139
0.0191 -0.0211
0.0182 -0.0333
0.0173 -0.0341
0.0132 -0.0363
0.0102 -0.0363
0.0084 -0.0402
0.0077 -0.0582
0.0058 -0.1184
0.0016 -0.1398
To test these data for non-normality using Arcus you must first prepare them in
a worksheet column. Then select the Shapiro-Wilk test from the parametric
methods menu of the analysis section.
Here the test statistic was clearly significant at p = 0.002 which rejects the
null hypothesis that these data are from a normal distribution. In fact these
data were from a 2 by 5 factor grouping experiment.
N.B. Do NOT use this test to say that your data are "normally distributed"
this is quite wrong! The Shapiro-Wilk test is to provide evidence for
certain types of "non-normality" it does NOT guarantee "normality".
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Nonparametric Methods|
¬<Mann-Whitney test>╪100302 ¬
¬<Wilcoxon's signed ranks test>╪102924 ¬
¬<Spearman's rank correlation>╪105582 ¬
¬<Kendall's rank correlation>╪107776 ¬
¬<Cuzick's test for trend>╪110163 ¬
¬<Two sample Smirnov test>╪112415 ¬
¬<Quantile confidence interval>╪114119 ¬
¬<Save ranked data>╪115938 ¬
¬<Save sorted data>╪117024 ¬
¬<Save normal scores>╪118837 ¬
This section provides various hypothesis tests and descriptive functions which
do not assume that your data are taken from normal distributions. When you
have few data or there is doubt about their distribution then you should err on
the side of caution and use nonparametric methods. These methods are usually
less sensitive than their parametric counterparts but they are more robust. The
numerical methods involved in these rank based calculations have progressed in
the last few years and Arcus Pro-Stat utilises the most modern developments,
including some calculations of exact probability in the presence of tied data.
An excellent account of nonparametric methods is given by Conover (ref 6).
In addition to the rank based tests below you can use three functions in this
section to save the ranks, sorted data or normal scores of a variable into a
new variable. The name of this new variable is the name of the source variable
prefixed with Rk~, Sr~ or Ns~ as appropriate.
|Mann-Whitney test| / Wilcoxon Rank Sum Test
This is a distribution free method for the comparison of two independent random
samples which have been measured using a scale that is at least ordinal. Arcus
uses the sampling distribution of U to give exact probabilities. This can take
a long time when there are tied data so please do not think that your computer
has crashed. Confidence intervals are constructed for the difference between
the two population means. The level of confidence used is as close as possible
to that which you have selected. Arcus approaches the selected confidence level
from the conservative side. When samples are large a normal approximation is
used for the hypothesis test and for the confidence interval (ref 6, A6, A19,
A20).
EXAMPLE: (from Conover ref 6 p 218) The following data represent fitness scores
from two groups of boys of the same age, those from homes in the town and those
from farm homes:
Farm Boys Town Boys
14.8 10.6 12.7 16.9 7.6 2.4 6.2 9.9
7.3 12.5 14.2 7.9 11.3 6.4 6.1 10.6
5.6 12.9 12.6 16.0 8.3 9.1 15.3 14.8
6.3 16.1 2.1 10.6 6.7 6.7 10.6 5.0
9.0 11.4 17.7 5.6 3.6 18.6 1.8 2.6
4.2 2.7 11.8 5.6 1.0 3.2 5.9 4.0
To analyse these data in Arcus you must first enter them in two separate
worksheet columns. Then select the Mann-Whitney test from the nonparametric
methods menu of the analysis section. Press enter when prompted for confidence
interval specifications, this accepts the default 95% level.
For this example:
difference between sample medians = 0.8
two tailed p = 0.53
95.1% CI for difference between population means = -2.4 to 4.4
Here we have assumed that these groups are independent and that they represent
at least hypothetical random samples of the sub-populations they represent. In
this analysis we clearly have to accept the null hypothesis that one group does
NOT tend to yield different fitness scores to the other. The extent of this
lack of difference is shown by zero being contained well within the confidence
interval for the difference between population means. Note that the quoted
95.1% confidence interval is as close as you can get to 95% because of the very
nature of the mathematics involved in nonparametric methods like this.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Wilcoxon's Signed Ranks| (matched pairs) test
This is a nonparametric method for the comparison of a pair of samples whose
component data have differences which are from a symmetrical distribution.
A two tailed test uses the null hypothesis that the common median of the
differences is zero. A confidence interval is constructed for the difference
between the population medians. The sum of the ranks for the positive
non-zero differences is given and the exact permutational probability
associated with this value is calculated for sample sizes of less than 30.
A normal approximation is used with sample sizes of 30 or more and when there
are ties. Please note that some statistical software uses an old approximation
formula which is inappropriate in the presence of ties. Conover (ref 6) states
that in the presence of ties the test statistic must be the sum of signed ranks
divided by the square root of this sum. You may be familiar with the old method
of using the smaller sum of ranks in one direction but this is not appropriate
with tied data. Confidence limits are calculated using critical values for k
with sample sizes up to 30 or by calculating K* for samples with more than 30
observations (ref 6, A20).
EXAMPLE (from Conover ref 6 p 283): The following data represent agressivity
scores for 12 pairs of monozygotic twins:
Firstborn: 86 71 77 68 91 72 77 91 70 71 88 87
Second Twin: 88 77 76 64 96 72 65 90 65 80 81 72
To analyse these data in Arcus you must first enter them into two columns in the
worksheet. Then select Wilcoxon's signed ranks test from the nonparametric
methods menu of the analysis section. Select a 95% confidence interval by
pressing enter when prompted by the confidence interval menu.
For this example:
two tailed p = 0.45
median difference = 1.5
95.8% CI for the difference between population medians = -2.5 to 6.5
Assuming that the paired differences come from a symmetrical distribution then
these results show that one group did not tend to yield different results to
the other group which was paired with it, i.e. there was no statistically
significant difference between the agressivity scores of the firstborn as
compared with the second twin. The extent of this lack of difference is shown
well by the confidence interval which clearly encompasses zero. Note that the
quoted 95.1% confidence interval is as close as you can get to 95% because of
the very nature of the mathematics involved in nonparametric methods like this.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Spearman's Rank Correlation|
This is a distribution free test of independence between two variables. It is,
however, insensitive to some types of dependence. Kendall's tau gives a much
better measure of correlation and is also a better test for independence in the
two tailed setting. Spearman's rank correlation coefficient (rho) is given to
six decimal places. The probability associated with rho is evaluated using a
recurrence method when n < 7 and the Edgeworth series expansion when n >= 7
(ref A13). A confidence interval for rho is constructed using Fisher's Z
transformation (ref 6, 11, 15).
EXAMPLE (from Conover ref 6 p 283): The following data represent agressivity
scores for 12 pairs of monozygotic twins:
Firstborn: 86 71 77 68 91 72 77 91 70 71 88 87
Second Twin: 88 77 76 64 96 72 65 90 65 80 81 72
To analyse these data in Arcus you must first enter them into two columns in the
worksheet. Then select Spearman's rank correlation from the nonparametric
methods menu of the analysis section. Select a 95% confidence interval by
pressing enter when prompted by the confidence interval menu.
For this example:
rho = 0.74
95% CI for rho = 0.28 to 0.92
two tailed p = 0.0082 **
Here we have clearly rejected the null hypothesis of mutual independence
between the agressivity scores of pairs of twins. With a two tailed test we
are considering the possibility of a positive or a negative correlation, i.e.
we can't be sure of this direction at the outset. A one tailed test would have
been restricted to correlation in one direction only i.e. big values of one
group associated with big values of the other (positive correlation) or big
values of one group associated with small values of the other (negative
correlation). In our example we can conclude that there is a statistically
significant lack of independence between agressivity scores of these twins.
We could then go on to speculate that agressivity had an inherited component,
especially if these twins were brought up by different families.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Kendall's Rank Correlation|
Spearman's rank correlation is satisfactory for testing a null hypothesis of
independence between two variables but it is difficult to interpret when the
null hypothesis is rejected. Kendall's rank correlation improves upon this by
reflecting the strength of the dependence between the variables being compared.
Arcus gives you the directional change statistics and the test statistic tau.
In the presence of ties the test statistic tau b is given (as Kendall 1970).
A normalised statistic (Z) is also given (continuity corrected and uncorrected)
with associated probability and this is adjusted, using the full variance
formula, in the presence of ties. In the absence of ties the probability
associated with S (and thus tau) is evaluated using a recurrence formula when
n < 9 and the Edgeworth series expansion when n >= 9 (ref A14). In the presence
of ties you must accept the normal approximation (ref 6, 15).
EXAMPLE (from Conover ref 6 p 283): The following data represent agressivity
scores for 12 pairs of monozygotic twins:
Firstborn: 86 71 77 68 91 72 77 91 70 71 88 87
Second Twin: 88 77 76 64 96 72 65 90 65 80 81 72
To analyse these data in Arcus you must first enter them into two columns in the
worksheet. Then select Kendall's rank correlation from the nonparametric
methods menu of the analysis section.
For this example:
tau = 0.56
continuity corrected two tailed p = 0.0136 *
Here we have clearly rejected the null hypothesis of mutual independence
between the agressivity scores of pairs of twins. With a two tailed test we
are considering the possibility of a positive or a negative correlation, i.e.
we can't be sure of this direction at the outset. A one tailed test would have
been restricted to correlation in one direction only i.e. big values of one
group associated with big values of the other (positive correlation) or big
values of one group associated with small values of the other (negative
correlation). In our example we can conclude that there is a statistically
significant lack of independence between agressivity scores of these twins.
We could then go on to speculate that agressivity had an inherited component,
especially if these twins were brought up by different families.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Cuzick's Test for Trend|
This provides a Wilcoxon-type test for trend across a group of three or more
independent randomly sampled variables. The component data must be at least
ordinal and groups must be selected in a meaningful order i.e. ordered. A
logistic distribution is assumed for errors. If you do not choose to enter your
own group scores then scores are allocated uniformly (1 ... n) in order of
selection of the n groups. For the null hypothesis of no trend across the
groups T will have mean ET, variance VarT and the null hypothesis is tested
using the normalised test statistic Z. Probabilities for Z are derived from
the standard normal distribution. Please note that this test is more powerful
than the application of the Wilcoxon rank-sum / Mann-Whitney test between
more than two groups of data (ref 28).
EXAMPLE (from Cuzick's paper ref 28): Mice were inoculated with cell lines,
CMT 64 to 181, which had been selected for their increasing metastatic
potential. The number of lung metastases found in each mouse after inoculation
are quoted below:
CMT 64 0, 0, 1, 1, 2, 2, 4, 9
CMT 167 0, 0, 5, 7, 8, 11, 13, 23, 25, 97
CMT 170 2, 3, 6, 9, 10, 11, 11, 12, 21
CMT 175 0, 3, 5, 6, 10, 19, 56, 100, 132
CMT 181 2, 4, 6, 6, 6, 7, 18, 39, 60
To analyse these data in Arcus you must first enter them in five worksheet
columns labelled appropriately. Then select Cuzick's test for trend from the
nonparametric methods menu of the analysis section. Just press N when you
are asked if you want to enter group scores, this does not apply to most
analyses provided you select the variables in the order you are studying them.
With automatic group scoring you must be careful to select the variables in
the order across which you want to look for trend.
For this example:
one tailed p (corrected for ties) = 0.017 *
With these data we started out expecting a trend in one direction only,
therefore, we can use a one tailed test for trend. We have show a statistically
significant trend for increasing number of metastases across these malignant
cell lines in this order.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Two Sample Smirnov Test|
Where you have two independent samples which have been drawn from possibly
different populations then you might consider looking for differences between
them using a t test or Mann-Whitney test. These tests are sensitive to
differences between two means or medians but do not consider other differences
such as variance. The two sample Smirnov method tests the null hypothesis that
the distribution functions of the populations from which your samples have been
drawn are identical. The test assumes that you have random samples which are
mutually independent. The measurement scale must be at least ordinal but for
an exact test you should use continuous data.
EXAMPLE (from Conover ref 6 p 370):
X Y
7.6 5.2 11.3
8.4 5.7 11.5
8.6 5.9 12.3
8.7 6.5 12.5
9.3 6.8 13.4
9.9 8.2 14.6
10.1 9.1
10.6 9.8
11.2 10.8
To analyse these data in Arcus you must first enter them into two worksheet
columns and label them appropriately. Then select the two sample Smirnov test
from the nonparametric methods section of the analysis section.
For this example:
two sided p = 0.26
Thus we can not reject the null hypothesis that the two populations from which
our samples were drawn have the same distribution function.
If we were interested in a one sided test then we would need good reason for
expecting one group to yield values above (distribution shifted to the right of)
or below (distribution shifted to the left of) the other group. For these data
neither of the one tailed tests reached significance.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Quantile Confidence Intervals|
This selection from the nonparametric methods menu provides a confidence
interval for any quantile. As with all nonparametric confidence intervals, the
exact confidence level is not always attainable but the level which is exact
to the interval constructed is displayed (ref 6,11). Arcus approaches the
confidence interval from the conservative side, i.e. if the nearest levels to
95% are 94.4% and 95.9% then the latter will be chosen. For sample sizes
greater than 30 a reliable approximation based on the central limit theorem is
used (ref 6). A presentation of medians and their confidence intervals is often
more meaningful than the time honoured (abused) tradition of presenting means
and standard deviations. A box and whisker plot is a useful accompaniment to
this function.
EXAMPLE (from Conover ref 6 p 113): The following represent times to failure
in hours for a set of pentode radio valves:
46.9 56.8 63.3 67.1
47.2 59.2 63.4 67.7
49.1 59.9 63.7 73.3
56.5 63.2 64.1 78.5
To analyse these data in Arcus you must first enter them into a worksheet
column and label it appropriately. Then select the quantile confidence interval
from the nonparametric methods section of the analysis section. For a 90%
confidence interval select the 90% button from the confidence interval menu.
Then enter 0.75 to specify that the quantile you want is the upper quartile or
75th percentile.
For this example:
upper quartile = 66.35
90% confidence interval = 63.3 to 73.3
exact confidence level = 90.94%
We may conclude that with 91% confidence the population value of the upper
quartile lies between 63.3 and 73.3 hours.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Save Ranked Data|
This function enables you to save the ranks of a worksheet variable into a new
variable. The name of this new variable is the name of the source variable
prefixed with Rk~. You can choose to calculate a correction factor for ties in
the ranking. Four formulae are offered for tie correction:
1. Σ(t3 - t / 12)
2. Σ(t * (t-1) /2)
3. Σ(t * (t-1) * (2t+5))
4. Σ(t * (t-1) * (t-2))
...where t is the number of data tied at each tie and upper case sigma (Σ)
is the summation across these ties.
EXAMPLE: Ranking the following agressivity scores for a sample of firstborn
twins gives:
First Born -----> Rk~First Born (Ranks)
86 8
┌─71 3.5
│ 77──────┐ 6.5
│ 68 │ 1
│ 91─┐ ├tie 11.5
tie┤ 72 ├tie │ 5
│ 77─│────┘ 6.5
│ 91─┘ 11.5
│ 70 2
└─71 3.5
88 10
87 9
|Save Sorted Data|
This function enables you to save the data of a worksheet variable into a new
variable in a sorted form. The name of this new variable is the name of the
source variable prefixed with Sr~. Sorting may be ascending or descending.
The sort may also be tied to the data of another variable, i.e. the data in
variable b may be sorted in the in the order of sorting variable a. This paired
sorting can be repeated for any number of columns.
EXAMPLE: Sorting the following agressivity scores for a sample of firstborn
twins in ascending order gives:
First Born -----> Sr~First Born (Sorted)
86 68
71 70
77 71
68 71
91 72
72 77
77 77
91 86
70 87
71 88
88 91
87 91
EXAMPLE 2: Sorting the following agressivity scores for a sample of second
born twins by the ascending order of the scores for firstborn twins gives:
First Born Second Born -----> Sr~Second Born~First Born
86 88 64
71 77 65
77 76 80
68 64 77
91 96 72
72 72 76
77 65 65
91 90 88
70 65 72
71 80 81
88 81 96
87 72 90
|Save Normal Scores|
This function enables you to save the normal scores of a worksheet variable
into a new variable. The name of this new variable is the name of the source
variable prefixed with Ns~. Normal scores are defined here as Z((2k-1)/2n)
where k is the rank, n is the sample size and Z is a standard normal deviate.
EXAMPLE: Scoring the following agressivity scores for a sample of firstborn
twins using the normal score formula above gives:
First Born -----> Ns~First Born (normal scores)
86 0.3186
71 -0.6745
77 0
68 -1.7317
91 1.3186
72 -0.3186
77 0
91 1.3830
70 -1.1503
71 -0.6745
88 0.8122
87 0.5485
|Regression and Correlation|
This section provides various regression and correlation analyses. Please note
that Kendall's and Spearman's correlations are provided in the nonparametric
methods section.
¬<Simple linear>╪124799 ¬
¬<Multiple linear>╪129024 ¬
¬<Regression in Groups>╪135745 ¬
¬<Polynomial>╪144338 ¬
¬<Linearized>╪148165 ¬
¬<Probit Analysis>╪149788 ¬
¬<Non-Linear Models>╪156043 ¬
REGRESSION
~~~~~~~~~~
Regression is a way of describing how one variable, the so called dependent
variable, is numerically related to other, so called predictor variables.
The dependent variable is also referred to as Y and is plotted on the vertical
axis (ordinate) of a graph. The predictor variable(s) is(are) also referred
to as X, independent, prognostic or explanatory variables. The horizontal
axis (abscissa) of a graph is used for plotting X. Predictors are variables
which we must be able to measure without error and we must have reason to
assume that the errors associated with measuring Y are randomly distributed.
All of the conclusions that we draw from regression depend upon the truth of
these assumptions about error. The commonest assumption is that the errors in
Y are from a random normal distribution. If this assumption is reasonable
and we suspect that the changes in Y are proportional to the changes in X then
we can try linear regression:
Y (% Growth 70-100 days) │ *
│ * * *
│
│ *
│ *
│ *
│ * *
│ * *
│
│ * *
└───────────────────────────
X (Birth Weight)
Looking at the data like this is a vital first step. From the graph we
suspect that low birth weight babies grow faster in the 70-100 days
interval than their higher birth weight counterparts. You could almost
draw a straight line through the points, therefore, assuming growth between
70 and 100 days is from a normal distribution we can try to fit a straight
line equation using simple linear regression on these data:
Equation: Y = A + BX
B is the gradient, slope or regression coefficient.
A is the intercept of the line at Y axis or regression constant.
The equation describes the best relationship between the POPULATION values of
X and Y which can be found using this method. When you have obtained this
equation it can be used to for prediction and various hypothesis tests.
N.B. Always think of the biological relevance of this equation, i.e. in our
example we must not get carried away with the idea that the growth of a baby
between 70 and 100 days after birth is a simple linear function of their birth
weight as there are many other variables affecting the babies' growths. We
could gather more information to feed into a complex multiple regression
but it is very unlikely that we could satisfy all of the above assumptions .
For these reasons data which are not drawn from highly controlled isolated
experiments must be treated with caution.
MATHS: The basic method used to find the regression equation is called least
squares. This minimises the sum of the squares of the errors associated with
each Y point by differentiation. This error is the difference between the
observed Y point and the Y point predicted by the regression equation. In
linear regression this error is also the error term of the Y distribution, the
residual error.
ASSUMPTIONS: X observed without error
Y drawn at random from a normal distribution for each X
True mean of Y distribution for each Y lies on regression line
All Y distributions have same variance (this is homoscedasticity)
Y error is independent of X
CORRELATION
~~~~~~~~~~~
This refers to the interdependence or co-relationship of variables. In the
context of our example it looks at the closeness of the linear relationship
between X and Y. A measure of this is given by Pearson's product moment
correlation co-efficient rho. Rho is called R when it has been estimated
from a regression on sample data. R lies between -1 and 1 with 0 for no
linear correlation, 1 for perfect positive (slope up) linear correlation and
-1 for perfect negative (slope down) linear correlation.
N.B. If R is close to ± 1 then this does NOT mean that there is a good causal
relationship between X and Y. It just shows that the sample data is close
to a straight line. R is a much abused statistic!
MATHS: R squared is the proportion of the total variance of Y that can be
explained by the linear regression of Y on X. 1-R² is the proportion that is
not explained by the regression. Thus 1-R² = S²XY / S²Y.
|Simple Linear Regression|
This provides simple linear regression (Y = A + BX) by the least squares method.
It is assumed that for each of the X values the corresponding Y values have
been drawn at random from a normal distribution. Summary statistics are given
in full as a springboard for further analysis. Pearson's product moment
correlation coefficient (r) is given as a measure of association between the
two variables. Confidence limits are constructed for the correlation
coefficient using Fisher's Z transformation. The null hypothesis that r = 0
(i.e. no association) is evaluated using a modified t test (ref 4, 5). The
estimated regression line may be plotted and belts representing the standard
error and confidence interval for the population value of the slope can be
displayed. These belts represent the reliability of the regression estimate,
the tighter the belt the more reliable the estimate (ref 11).
NB If you require a weighted linear regression then please use the multiple
linear regression function in Arcus, it will allow you to use just one
predictor variable i.e. the simple linear regression situation. Note also
that the multiple regression option will allow you to select regression
without an intercept i.e. forced through the origin.
EXAMPLE (from Armitage ref 4 p 148): The following data represent birth
weights of babies and their percentage increase between 70 and 100 days after
birth:
X (birth weight oz) Y (increase in weight 70-100 days as % of X)
72 68
112 63
111 66
107 72
119 52
92 75
126 76
80 118
81 120
84 114
115 29
118 42
128 48
128 50
123 69
116 59
125 27
126 60
122 71
126 88
127 63
86 88
142 53
132 50
87 111
123 59
133 76
103 72
106 90
118 68
114 93
94 91
To analyse these data in Arcus you must first enter them into two columns in
the worksheet appropriately labelled. Then select simple linear regression
from the regression and correlation menu of the analysis section. Press enter
when you are prompted for a confidence interval, this will select the default
95% level.
For this example:
Y = -0.8643X + 167.8701
95% CI for slope = -0.5055 to -1.2231
r square = 0.4465
F for regression = 24.2 (p = < 0.0001)
r = -0.6682
95% CI for r = -0.4166 to -0.8248
two tailed p (for r = 0) = < 0.0001
From this analysis we have gained the equation for a straight line forced
through our data i.e. % increase in weight = 167.87 - 0.864 * birth weight.
The r square value tells us that about 42% of the total variation about the
Y mean is explained by the regression line. The analysis of variance test for
the regression, summarised by the ratio F, shows that the regression itself was
statistically highly significant. This is equivalent to a t test with the null
hypothesis that the slope is equal to zero. The confidence interval for the
slope shows that with 95% confidence the population value for the slope lies
somewhere between -0.5 and -1.2. The correlation coefficient r was
statistically highly significantly different from zero. Its negative value
indicates that there is an inverse relationship between X and Y i.e. lower
birth weight babes show greater % increases in weight at 70 to 100 days after
birth. With 95% confidence the population value for r lies somewhere between
-0.4 and -0.8.
¬<regression and correlation>╪119789 ¬
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Multiple Linear Regression|
If you need to study the effect of simultaneous changes in several independent
variables (e.g. creatinine clearance and mean systolic blood pressure) upon one
dependent variable (e.g. post-anaesthetic recovery time) then you might find
multiple linear regression useful. Arcus uses singular value decomposition to
solve the linear equations, this is a robust method which optimises accuracy and
is not stalled by serial correlation. The multiple regression equation is given
and the significance of each component parameter is indicated. There are also
options for analysis of variance and interpolation. The analysis of variance
provides a test of independence for the Y variable in comparison with the X
variables. A multiple correlation coefficient is given with the analysis of
variance. A logical extension of multiple linear regression is the selection
of predictor (X, independent) variables. There are a number of methods which
deal with this, for example step-up selection, step-down selection, stepwise
regression and best subset selection. The fact that there is not a predominantly
favoured method means that none of them are really satisfactory for general use,
a good discussion is given by Draper and Smith (ref 23). The current version
of Arcus provides best subset selection by examination of all possible
regressions. You have the option of two selection criteria, minimum Mallow's
Cp statistic or maximum overall F. You may also force the inclusion of
variables in this selection procedure if you consider their exclusion to be
illogical in "real world" terms (ref 23).
EXAMPLE (from Armitage ref 4 p 300): The following data are from a trial of
a hypotensive drug used to lower blood pressure during surgery. The outcome /
dependent variable (Y) is minutes taken to recover an acceptable (100mmHg)
systolic blood pressure and the two predictor or explanatory variables are,
log dose of drug (X1) and mean systolic blood pressure during the induced
hypotensive episode (X2).
X1 X2 Y
2.26 66 7
1.81 52 10
1.78 72 18
1.54 67 4
2.06 69 10
1.74 71 13
2.56 88 21
2.29 68 12
1.80 59 9
2.32 73 65
2.04 68 20
1.88 58 31
1.18 61 23
2.08 68 22
1.70 69 13
1.74 55 9
1.90 67 50
1.79 67 12
2.11 68 11
1.72 59 8
1.74 68 26
1.60 63 16
2.15 65 23
2.26 72 7
1.65 58 11
1.63 69 8
2.40 70 14
2.70 73 39
1.90 56 28
2.78 83 12
2.27 67 60
1.74 84 10
2.62 68 60
1.80 64 22
1.81 60 21
1.58 62 14
2.41 76 4
1.65 60 27
2.24 60 26
1.70 59 28
2.45 84 15
1.72 66 8
2.37 68 46
2.23 65 24
1.92 69 12
1.99 72 25
1.99 63 45
2.35 56 72
1.80 70 25
2.36 69 28
1.59 60 10
2.10 51 25
1.80 61 44
To analyse these data in Arcus you must first enter them into three columns in
the worksheet appropriately labelled. Then select multiple linear regression
from the regression and correlation menu of the analysis section. Press Esc
when you are asked for the standard deviations of Y, i.e. selecting an
unweighted analysis. Press Y when you are asked whether you want an intercept,
one can rarely find a good enough reason not to have an intercept.
For this example:
Y = 23.01 + 23.639 X1 - 0.715 X2
Intercept b0 = 23.01067 (p = 0.214)
X1 b1 = 23.63856 (p = 0.001)
X2 b2 = - 0.71468 (p = 0.022)
r square = 0.2018
r square adjusted = 0.1699
F = 6.32 (p = 0.001)
The variance ratio, F, for the overall regression is highly significant thus we
have very little reason to doubt that either X1 or X2 is, or both are,
associated with Y. The r square value shows that only 20% of the variance of
Y is accounted for by the regression, therefore the predictive value of this
model is low. The partial correlation coefficients are shown to be significant
but the intercept is not.
Arcus offers more facilities for general linear regression than I have shown
here. The use of these facilities requires a reasonable background knowledge
of general linear models and their assumptions. For this reason I shall not
discuss all of these facilities with examples, the experienced user will be
familiar with these facilities. A good reference is Draper & Smith ref 23.
In summary, these facilities are:
1. Best subset selection. When you have many predictor variables you can ask
Arcus to select the subset of predictor variables which gives the "best"
fitting model as judged by Mallow's Cp statistic or the overall significance
of the regression. Mallow's Cp is favoured in most situations.
2. XXi matrix. This prints out the XXi or Hat / projection matrix of the
linear model. Double precision is displayed as the singular value
decomposition of this general linear regression is performed in double
precision.
3. Influential data. This gives an analysis of residuals and allows you to
save the residuals and their associated statistics. It is good practice to
examine a plot of the residuals against Y. You might also wish to have a
normal plot of the residuals, this is available in the pictorial statistics
menu of the Arcus analysis section. Along with the residuals you are given
the standard error of the predicted Y, the leverage Hi (the ith diagonal
element of the Hat matrix), Studentized residuals, Cook's distance ,
covariance and DFFITS. Note that Studentized residuals have a t
distribution with n-p-1 degrees of freedom. If Hi is larger than 2p/n then
that observation has unusual predictor values. Unusual predicted as
opposed to predictor values are indicated by large residuals. Cook's
distance and DFFITS combine these factors in an overall measure. Cook's D
can be considered large if it exceeds F (0.50, p, n-p) from the F
distribution. DFFITS is unusually large if it is greater than 2 * SQR(p/n).
Unusual covariance ratios are considered to lie outside the range
1 - 3 * (n/p) to 1 + 3 * (n/p). A good discussion of the analysis of
residuals is given by Belsley et al. ref 32. In this paragraph p = number
of coefficients in the model (including constant) and n = number of
observations.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Regression in Groups|
¬<Linearity with replicates of Y>╪136133 ¬
¬<Grouped linear regression with covariance analysis>╪139152 ¬
This sub-section provides grouped linear regression and analysis of covariance.
There is also a test for linearity when repeated observations of the Y
(dependent) variable are available for each observation in the X (independent)
variable.
|Linearity with replicates of Y|
The standard analysis of variance for a linear regression tells you about the
significance of the slope but it does not test whether or not you should be
using linear regression in the first place. Here we provide a method which
can be used to test the assumption of linearity.
In important studies which utilise linear regression it is worth collecting
repeat Y observations. This enables you to run a test of linearity and thus
justify or refute the use of linear regression in subsequent analysis of these
data (ref 4). The replicate Y observations should be entered in separate
worksheet columns (variables), one column for each observation (row) in the X
variable. The number of Y replicate variables which you are prompted to
choose is governed by the size of the X variable which you have selected.
EXAMPLE (from Armitage, ref. 4 p268): A preparation of vitamin D is
tested by feeding it to rats with induced osteomalacia and measuring the
subsequent re-mineralisation of their bones using radiographic methods:
Log dose of Vit D ---> 0.544 0.845 1.146
Bone density score --> 0 1.5 2
0 2.5 2.5
1 5 5
2.75 6 4
2.75 4.25 5
1.75 2.75 4
2.75 1.5 2.5
2.25 3 3.5
2.25 3
2.5 2
3
4
4
To analyse these data in Arcus you must first enter them into four columns in
the worksheet appropriately labelled. The first column is just three rows long
and contains the three log doses of vitamin D above. The next three columns
represent the repeated measures of bone density for each of the three levels
of log dose of vitamin D which are represented by the rows of the first column.
Then select the linearity function from the regression in groups sub-menu of the
regression and correlation menu in the analysis section. When you are prompted
for the X variable select the column which contains the three log dose levels.
Then select the three Y columns which correspond to each row (level) of the
X variable i.e. 0.544 --> 0.845 --> 1.146.
For this example:
Due to regression F = 9.45 (p = 0.0047)
Deviations from X means F = 1.95 (p = 0.1738)
Thus the regression itself (meaning the slope) was statistically highly
significant. If the deviations from X means had been significant then we
should have rejected our assumption of linearity, as it stands they were not.
Arcus gives you plain English interpretations of these results directly.
¬<p values>╪29175 ¬
¬<non-linear models>╪156043 ¬
¬<reference list>╪310584 ¬
|Grouped linear regression with covariance analysis|
The grouped regression function enables you to compare regression lines. Again
it is assumed that for each of the X values the corresponding Y values have been
drawn at random from a normal distribution. The method involves examination of
the regression parameters for a group of XY pairs in relation to a common fitted
function. This provides an analysis of variance which shows whether there is
a significant difference between the slopes of the individual regression lines
as a whole. Arcus then compares all of the slopes individually. The vertical
distance between each regression line is then examined using analysis of
covariance and the corrected means are given (ref 4) This is just one facet of
the analysis of covariance and there exist alternative methods. For further
information please consult good references such as Draper & Smith (ref 23) and
Armitage & Berry (ref 4).
EXAMPLE (from Armitage ref. 4 p 277): Three preparations of vitamin D are
tested by feeding them to rats with induced osteomalacia and measuring the
subsequent re-mineralisation of their bones using radiographic methods:
For the standard preparation:
Log dose of Vit D ---> 0.544 0.845 1.146
Bone density score --> 0 1.5 2
0 2.5 2.5
1 5 5
2.75 6 4
2.75 4.25 5
1.75 2.75 4
2.75 1.5 2.5
2.25 3 3.5
2.25 3
2.5 2
3
4
4
For alternative preparation I:
Log dose of Vit D ---> 0.398 0.699 1.000 1.301 1.602
Bone density score --> 0 1 1.5 3 3.5
1 1.5 1 3 3.5
0 1.5 2 5.5 4.5
0 1 3.5 2.5 3.5
0 1 2 1 3.5
0.50 0.5 0 2 3
For alternative preparation F:
Log dose of Vit D ---> 0.398 0.699 1.000
Bone density score --> 2.75 2.5 3.75
2 2.75 5.25
1.25 2.25 6
2 2.25 5.5
0 3.75 2.25
0.5 3.5
To analyse these data in Arcus you must first enter them into 14 columns in
the worksheet appropriately labelled. The first column is just three rows long
and contains the three log doses of vitamin D for the standard preparation.
The next three columns represent the repeated measures of bone density for each
of the three levels of log dose of vitamin D which are represented by the rows
of the first column. This is then repeated for the other two preparations.
Then select the grouped linear regression function from the regression in groups
sub-menu of the regression and correlation menu in the analysis section. Enter
3 as the number of XY pairs and select Y when asked if you wish to use
replicates. When you are prompted for the first X variable select the column
which contains the three log dose levels for the standard preparation. Then
select the three Y columns which correspond to each row (level) of the X
variable for the standard preparation i.e. 0.544 --> 0.845 --> 1.146.
Alternatively these data could have been entered in just three pairs of
worksheet columns representing the three preparations with a log dose column
and column of the mean bone density score for each dose level. By accepting
the more long winded input of replicates Arcus is encouraging you to run a
test of linearity on your data.
For this example:
common slope p = < 0.0001
between slopes p = 0.1510
slope comparisons: standard vs I p = 0.4195
standard vs F p = 0.0379
I vs F p = 0.0325
corrected covariance analysis:
F = 1.69 (p = 0.2510)
vertical separations: standard vs I p = 0.3070
standard vs F p = 0.4345
I vs F p = 0.2493
The common slope is highly significant and the test for difference between the
slopes overall was non-significant. Provided that our assumption of linearity
holds true we can conclude that these lines are reasonably parallel. Looking
more closely at the individual slopes preparation F is shown to be significantly
different from the other two but this difference was not large enough to throw
the overall slope comparison into a significant heterogeneity.
The analysis of covariance did not show any significant vertical separation of
the three regression lines.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Polynomial Regression|
If you have reason to believe that a polynomial model is appropriate to your
data then you can use this function to construct one. You supply the number
of degrees (order) of the polynomial and Arcus gives you the coefficient for
each degree of the equation together with the constant. Subjective goodness of
fit may be assessed by plotting the data and the fitted curve. Try to use as
few degrees as possible for a model which achieves significance at each degree.
Regression is by singular value decomposition (ref 23, 14). An analysis of
variance is given via the analysis option. There is also an option which
calculates the area under the curve. The polynomial function which has been
fitted is integrated from the lowest to the highest X value using Romberg's
method to give an area under the fitted curve. The trapezoidal rule is also
used directly on the vector to give another estimate of the area under the
curve. The plot function supplies visual confidence and prediction intervals
but you can save the predicted Y values with their errors and intervals by
selecting option [6].
If you require more detail from the regression, such as an analysis of the
residuals, then you should use the multiple linear regression option. To
achieve a polynomial fit using multiple linear regression you must first
create new worksheet columns which contain the X variable raised to powers
up to the degree you want. For example, a second order fit requires Y,
X and X * X.
EXAMPLE: (from Statistics ref 34 p 753): Here we will use a non-biomedical
example to emphasise the point that polynomial regression is more often
applicable to data from the physical sciences where variables are more
controllable. Below are the electricity consumption data in kilowatt hours
per month from ten houses and the areas in square feet of these houses:
House area KW-hours per month
1290 1182
1350 1172
1470 1264
1600 1493
1710 1571
1840 1711
1980 1804
2230 1840
2400 1956
2930 1954
To analyse these data in Arcus you must first prepare them in two worksheet
columns appropriately labelled. Then select polynomial regression from the
regression and correlation menu of the analysis section. The X (independent)
variable is house area and the Y (dependent) variable is KW-hours per month.
Enter the order of this polynomial as 2.
For this example:
KW-hours = -1216.14389 + 2.39893 * area - 0.00045 * area * area
F = 189.71 (p < 0.0001)
Root MSE = 46.801
R sqr = 0.9819
for intercept p = 0.0016
X p < 0.0001
X*X p = 0.0001
Thus the overall regression and both degree coefficients are highly significant.
NB Look at a plot of this data curve. The right hand end point shows a very
sharp decline. If you were to extrapolate beyond the data you have observed
then you might conclude that very large houses have a very low electricity
consumption. This is obviously ludicrous. Polynomials are often well out
of line with common sense in parts of the curve but seem to fit other parts
well. You must blend common sense, art and mathematics when fitting these
models! Remember that, a) your model will be much more reliable if it is
built around large numbers of observations, b) do not extrapolate beyond
your observations, c) choose numbers for X which are not too large as they
will cause overflow with higher degree polynomials, d) do not draw false
confidence from low p values, only use these to support your model if the
plot looks good!
¬<p values>╪29175 ¬
¬<non-linear models>╪156043 ¬
¬<reference list>╪310584 ¬
|Linearized Estimates|
This section provides regression estimates for three linearised functions by
an unweighted least squares method. This approach is far from ideal and should
be used only to indicate that a more robust fit of the selected model might be
appropriate for your data. Exponential, geometric and hyperbolic approximations
are offered.
For the exponential model the data are linearized by log transformation of
the independent variable and the linear regression gives you A and B for the
function Y = A * exp(B * X).
For the geometric method the natural logarithms of both variables are
linearly regressed for Y = A * (X ^B).
The hyperbolic method uses the reciprocals of both variables to calculate A and
B for Y = X / (A + B * X).
The standard error of the estimate is given for each of these regressions
but please note that the errors of your dependent / response variable might
not be from a normal distribution.
This section of Arcus is only intended for those who are familiar with
regression modelling and who use these linearized estimates as a springboard
for further modelling. For these reasons we will not work through an example
here. For generalized linear modelling I recommend the products of The Numerical
Algorithms Group and Rothamstead Experimental Station, these are GLIM and
Genstat. For non-linear modelling I recommend MLP and Genstat. For information
on all of these products contact NAG on UK (0)865 511 245.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<non-linear models>╪156043 ¬
¬<reference list>╪310584 ¬
|Probit Analysis|
When biological responses are plotted against their causal stimuli (or
logarithms of them) they often describe a sigmoid curve. Methods have been
developed which linearize this relationship so that they are easier to deal
with numerically. This linearization can be achieved using a number of
transformations including logit, probit and angular. For most systems the
probit (normal sigmoid) and logit (logistic sigmoid) give the most closely
fitting result. Logistic methods are also useful in Epidemiology because
odds ratios can be determined easily from differences between fitted logits.
In biological assay, however, probit analysis is preferable (ref 18, 19).
Curves produced by these methods are very similar, with maximum variation
occurring within 10% of the upper and lower asymptotes. Historically some
workers have used logistic regression because it is easier to calculate than
probit analysis, this is no longer true with the aid of computers.
Probit analysis has been added to Arcus to provide dose/stimulus - response
curve fitting. Your data are entered as dose levels, number of subjects tested
at each dose level and number responding at each dose level. You are also
given the opportunity to enter a control result for the number of subjects
responding in the absence of dose/stimulus - this provides a global adjustment
for natural mortality/responsiveness. You are also asked whether you want log
transformation of the dose levels or not. The curve is then fitted by Newton
-Raphson iteration. The quality of the resultant curve is assessed by
statistics for heterogeneity which follow a chi-square distribution. If these
are significant then your observed values deviate from the fitted curve too
much for reliable inference to be made from that curve (ref 18, 19). Arcus
gives you the effective/lethal levels of dose/stimulus with confidence intervals
at the quantiles you specify. The fitted curve can be plotted and printed.
If you require more complex probit analysis, such as the calculation of
relative potencies from several related dose response curves, then you should
consider using non-linear optimization software or specialist dose-response
analysis software such as Bliss. The latter is a FORTRAN routine written by
David Finney and Ian Craigie, it is available from Edinburgh University
Computing Centre. If you are considering using Bliss then you must be familiar
with FORTRAN and the basic principles of probit analysis (ref 18, 19). For more
general non-linear model fitting with the ability to constrain curves to
"parallelism" then I advise you to use MLP or Genstat. At this point most
people should seek statistical help. More information is available under the
notes on ¬non-linear models╪156043 ¬.
CAUTION: Please do not think of probit analysis as a "cure all" for dose
response curves. Many log dose - response relationships are clearly not
Gaussian sigmoids. They may not be any of the other sigmoids either, e.g.
angular, Wilson-Worcester or Cauchy-Urban. You may not be able to use a
regression model "off the shelf". This brings us to the complex subject of
non-linear modelling. At this point most people should seek statistical help.
Please refer to the notes on ¬non-linear models╪156043 ¬.
CAUTION 2: Please remember that this form of probit analysis is designed to
handle only quantal responses with binomial error distributions. Quantal data,
such as the number of subjects responding vs total number of subjects tested,
usually have binomial error distributions. You must NOT use continuous data,
such as % maximal response, with probit analysis as these data require
regression methods which assume a different error distribution. Again, at this
point most people should seek statistical help. Please refer to the notes on
¬non-linear models╪156043 ¬.
EXAMPLE (from Finney ref 18 p 98): The following data represent a study of the
age at menarche of 3918 Warsaw girls. For each age group you are given mean
age, total number of girls and the number of girls who had reached menarche.
Age Girls + Menses
9.21 376 0
10.21 200 0
10.58 93 0
10.83 120 2
11.08 90 2
11.33 88 5
11.58 105 10
11.83 111 17
12.08 100 16
12.33 93 29
12.58 100 39
12.83 108 51
13.08 99 47
13.33 106 67
13.58 105 81
13.83 117 88
14.08 98 79
14.33 97 90
14.58 120 113
14.83 102 95
15.08 122 117
15.33 111 107
15.58 94 92
15.83 114 112
17.58 1049 1049
To analyse these data in Arcus you must first prepare them in three worksheet
columns appropriately labelled. Then select probit analysis from the regression
and correlation menu of the analysis section. "Dose" levels here are the mean
ages, number in each group are the number of girls and number responding are
the number + menses. Select probit as the sigmoid model. Then select a 95%
confidence interval by pressing the enter key when you see the confidence
interval menu. Select N when asked whether or not you require logarithmic
conversion of the independent variable (mean ages).
For this example:
Y = -6.8189 + 0.9078 X in probits
heterogeneity of deviations from model p = 0.5262
ED50:
The estimated median age at menarche = 13.02 (95% CI = 12.94 to 13.09)
Having looked at a plot of this model and accepted it as appropriate we can
conclude with 95% confidence that the true population value for median age at
menarche in Warsaw lay between 12.94 and 13.09 years when this study was done.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<non-linear models>╪156043 ¬
¬<reference list>╪310584 ¬
|Non-Linear Models|
Biomedical research reveals many relationships which are inherently non-linear.
One way of dealing with this is to transform variables so that the relationship
between them approximates linearity. This works well in many cases but is not
possible in others.
One of the greatest problems you face when fitting transformed variables is that
errors you assumed to be normal in the non-transformed variable become
non-normal after transformation. In specific cases such as the probit analysis
in Arcus, this has been anticipated and the error calculations have been
designed to cope with the expected error distribution. It is not advisable to
feed transformed variables through linear regression. If you are confident of
a particular model then you are justified in using a generalised linear model
method to fit your data. Examples of this are probit analysis and logistic
regression. Please note that the current version of Arcus Pro-Stat does not
offer multiple logistic regression. A multiple logistic regression module is
under development for the next version of Arcus Pro-Stat. SAS, Genstat and
GLIM have logistic regression functions.
If you need to develop a non-linear model for your data then you MUST know what
you are doing. This is a highly complex area which blends gut feeling, art and
science. Please seek statistical advice if you want to build non-linear models.
It is not the place of Arcus to cover this large and highly specialised field,
you should seek out a well validated non-linear estimation package that is
supported by experts in the field. The only such packages I have found are
MLP and Genstat. The former is a dedicated non-linear estimation package of
academic excellence from Gavin Ross at Rothamsted Experimental Station. He is
widely published in this field and in my opinion both MLP and his book (ref 34)
represent the state of the art in practical non-linear modelling. Genstat
is a general stats package which includes many of the functions of MLP because
it also comes from Rothamstead. Genstat is not as easy to use as Arcus but it
covers a number of specialist areas which Arcus does not. I would recommend
Genstat as a good partner to Arcus.
Nota Bene!! PLEASE BEWARE OF PACKAGES WHICH CLAIM TO BE "BLACK BOXES" FOR
NON-LINEAR MODELLING, THIS IS NOT POSSIBLE AT PRESENT (1994).
For more information on GenStat, MLP or GLIM please contact the Numerical
Algorithms Group on UK (0)865 511 245.
|Analysis of Variance|
¬<One way>╪162596 ¬
¬<Two way>╪164960 ¬
¬<Two way with replicates>╪167988 ¬
¬<Crossover>╪171202 ¬
¬<Kruskal-Wallis>╪174413 ¬
¬<Friedman>╪177293 ¬
Analysis of variance (ANOVA) represents a group of methods for investigating
how the means of variables are affected by the way in which those variables are
classified. In practical terms, you can test for an overall difference between
the population means for a group of samples within the constraints of a given
experimental design. Arcus then allows you to make individual comparisons
between each of the groups using methods which have been designed for the
multiple comparison or simultaneous inference situation. When multiple
comparisons are made you are in danger of type I error when using t tests alone,
thus, more conservative approaches are required. Arcus offers you the methods
due to Scheffé, Newman-Keuls and gives Bonferroni's limitation with the t tests
(ref 4, 13, 22). With the Newman-Keuls method, means are first ordered in
sequence then each possible discrete comparison is made. The probability
associated with the resultant q values are then derived from the Studentized
range. For Scheffé's test all possible linear contrasts are also made
automatically. Please note that this is a controversial area in statistics and
you would be wise to seek the advice of a statistician before you design your
study. In general you should design experiments so that you can avoid having
to "dredge" groups of data for differences, decide which contrasts you are
interested in at the outset. An excellent account of ANOVA is given by
Armitage & Berry (ref 4). The nonparametric alternatives to ANOVA are also
covered in this section.
BEYOND ARCUS:
If each treatment/exposure factor in your design contains sub-factors of
treatment/exposure groups then you should consider a nested hierarchical
analysis of variance. This design is not covered by the present version of
Arcus, SAS gives a reasonably good implementation of it.
Hospital 1 Hospital 2
* *
ward 1 ward 2 ward 3 ward 1 ward 2 ward 3
x x x x x x <--- patients
x x x x x x
x x x x x x
x x x x x
x x x
x x
If your design represents repeated exposures/treatments for two different
categorisations then you should consider a Latin square design. An example of
this is the response of 5 different rats (factor 1) to 5 different treatments
(repeated blocks) when housed in 5 different types of cage (factor 2).
Rat 1 2 3 4 5
Cage
1 A E C D A
2 E B A B C
3 C D E D D
4 D C B C B
5 B A D A E
For designs with complete missing blocks you should consider a balanced
incomplete block design provided the number of missing blocks does not exceed
the number of treatments.
Block 1 2 3 4
Treatment A x x x
B x x x
C x x x
D x x x
If all factor levels in a design are of intrinsic interest rather than some
form of randomised blocking then you should consider a factorial design.
Factorial ANOVA can combine levels into treatments, a simple application of
this is the crossover ANOVA which is offered by Arcus. More complex factorial
designs require careful planning and I would advise you to seek statistical
advice at this stage.
These ANOVA designs are not covered by the current version of Arcus. SAS offers
a range of complex ANOVA's and BMDP covers most.
|One Way|
Imagine you have four groups of data which represent one experiment performed
on four different occasions with ten different subjects each time. You can
test the consistency of the experimental conditions or the inherent error of
the experiment using a one way analysis of variance. This assumes that each
group comes from an approximately normal distribution and that the variability
within the groups is roughly constant. The factors are arranged so that
experiments are columns and subjects are rows, this is how you must enter your
data in the Arcus worksheet. The F test is fairly robust to small deviations
from these assumptions but you could use the ¬Kruskal-Wallis╪174413 ¬ test if there was
any doubt. A significant test indicates a difference between the population
means for the groups as a whole. You may then go on to make ¬multiple contrasts╪180310 ¬
between the groups but this "dredging" should be avoided if possible. If the
groups in this example had been a series of treatments / exposures to which
subjects (blocks) were randomly allocated then a two way randomised block design
ANOVA should have been used.
EXAMPLE (from Armitage ref 4 p 193):
The following data represent the numbers of worms isolated from the GI tracts
of four groups of rats in a trial of carbon tetrachloride as an anthelminthic.
These four groups were the control (untreated) groups:
Expt 1 Expt 2 Expt 3 Expt 4
279 378 172 381
338 275 335 346
334 412 335 340
198 265 282 471
303 286 250 318
To analyse these data in Arcus you must first prepare them in four worksheet
columns appropriately labelled. Then select the one way function from the
analysis of variance menu of the analysis section. Enter the number of groups
as four.
For this example:
F = 2.27 p= 0.1195
The null hypothesis that there is no difference in mean worm counts across the
four groups is held. If we had rejected this null hypothesis then we would
have had to take a close look at the experimental conditions to make sure that
all control groups were exposed to the same conditions.
¬<p values>╪29175 ¬
¬<multiple contrasts>╪180310 ¬
¬<analysis of variance>╪158578 ¬
¬<reference list>╪310584 ¬
|Two Way|
If your data are classified simultaneously by two factors such that each level
of one factor can be combined with all levels of the other factor then a two way
ANOVA might be appropriate. If one of these factors represents treatments/
exposures and the other represents experimental subjects which have been
randomly allocated to each of these treatments then you are justified in using
a randomised block design. The factors are arranged so that treatments are
columns and subjects are rows, this is how you must enter your data in the Arcus
worksheet. The warnings above concerning multiple comparison methods apply here
also. There is no special provision for substitution of missing data in the
simple two way ANOVA, a row containing a missing value is simply left out of
the analysis.
If you wish to use a two way ANOVA but your data are clearly non-normal then
you should consider the nonparametric alternative due to Milton ¬Friedman╪177293 ¬.
EXAMPLE (from Armitage ref 4 p 218):
The following data represent clotting times (mins) of plasma from eight subjects
treated in four different ways. The eight subjects (blocks) were allocated at
random to each of the four treatment groups:
Treatment 1 Treatment 2 Treatment 3 Treatment 4
8.4 9.4 9.8 12.2
12.8 15.2 12.9 14.4
9.6 9.1 11.2 9.8
9.8 8.8 9.9 12
8.4 8.2 8.5 8.5
8.6 9.9 9.8 10.9
8.9 9 9.2 10.4
7.9 8.1 8.2 10
To analyse these data in Arcus you must first prepare them in four worksheet
columns appropriately labelled. Then select two way from the analysis of
variance menu of the analysis section. Enter the number of groups as four.
For this example:
F (VR between subjects) = 17.2042 P < 0.0001 ***
F (VR between groups) = 6.61503 P = 0.0025 **
Newman-Keuls Multiple Comparisons
Treatment 4 vs Treatment 3 Q = 3.798024 P = 0.0140 *
Treatment 4 vs Treatment 2 Q = 4.583823 P = 0.0106 *
Treatment 4 vs Treatment 1 Q = 6.024452 P = 0.0020 **
Treatment 3 vs Treatment 2 Q = .7857996 P = 0.4155
Treatment 3 vs Treatment 1 Q = 2.226428 P = 0.2785
Treatment 2 vs Treatment 1 Q = 1.440628 P = 0.3201
Here we can see that there was a statistically highly significant difference
between mean clotting times across the groups. The difference between
subjects is of no particular interest here. The ¬multiple contrasts╪180310 ¬ show us
that the mean clotting time for group four is statistically significantly
different from the other three which are not significantly separated from
each other.
¬<p values>╪29175 ¬
¬<multiple contrasts>╪180310 ¬
¬<analysis of variance>╪158578 ¬
¬<reference list>╪310584 ¬
|Two Way with Replicates|
The simple two way randomised block design assumes that the row (subject) and
column (group) effects are additive. This means that apart from experimental
error, the difference in effect between any two rows is the same for all columns
and vice versa. If these effects are not additive then there exists a row
-column interaction which must be investigated by repeating the observations
for each block. These data can then be analysed using this two way randomised
block design ANOVA for repeated observations. Arcus will compensate for missing
observations in the replicates by estimating them as the mean of the replicates
present and by reducing the degrees of freedom, you should avoid this situation
if possible. Enter each set of replicates in a separate worksheet column so
that there is a different Arcus variable for each cell of the two way table,
i.e. the third dimension coming out of the page, the replicates, is as deep as
the number of rows for these data in the worksheet.
EXAMPLE (from Armitage ref 4 p 221):
The following data represent clotting times (mins) from three subjects treated
in three different ways. The plasma samples were allocated randomly to the
treatments and the analysis was repeated three times for each sample.
Treatment A B C
Subject 1 9.8 9.9 11.3
10.1 9.5 10.7
9.8 10 10.7
Subject 2 9.2 9.1 10.3
8.6 9.1 10.7
9.2 9.4 10.2
Subject 3 8.4 8.6 9.8
7.9 8 10.1
8 8 10.1
To analyse these data in Arcus you must first prepare them in nine worksheet
columns:
s = subject
t = treatment
s1t1 s1t2 s1t3 s2t1 s2t2 s2t3 s3t1 s3t2 s3t3
9.8 9.9 11.3 9.2 9.1 10.3 8.4 8.6 9.8
10.1 9.5 10.7 8.6 9.1 10.7 7.9 8 10.1
9.8 10 10.7 9.2 9.4 10.2 8 8 10.1
Next select the two way with replicates option from the analysis of variance
menu of the analysis section. Enter the number of groups as three and the
number of subjects as three.
For this example:
F (VR Subjects) = 63.13918 P < 0.0001 ***
F (VR Groups) = 80.32172 P < 0.0001 ***
F (VR Interaction) = 2.522677 P = 0.1082
Newman-Keuls Multiple Comparisons
Group 3 vs Group 2 Q = 26.22421 P = 0.0002 ***
Group 3 vs Group 1 Q = 27.50345 P = 0.0001 ***
Group 2 vs Group 1 Q = 1.279235 P = 0.3778
Here we see a statistically highly significant difference between mean clotting
times across the groups and more specifically, group 3 stands out from the rest.
If the F value for interaction had been significant then there would have been
little point in drawing conclusions about independent group and subject effects
from the other F values.
¬<p values>╪29175 ¬
¬<multiple contrasts>╪180310 ¬
¬<analysis of variance>╪158578 ¬
¬<reference list>╪310584 ¬
|Crossover|
If a group of subjects is exposed to two different treatments A and B then a
crossover trial would involve half of the subjects being exposed to A then B and
the other half to B then A. A washout period is allowed between the two
exposures and the subjects are randomly allocated to one of the two orders of
exposure. A simple crossover design ANOVA can be applied to these data. The
two times when the groups are exposed to the treatments are known as period 1
and period 2. This ANOVA tests for treatment effects, period effects and
treatment-period interaction. For further information please refer to Armitage
& Berry (ref 4).
EXAMPLE (from Armitage ref 4 p224):
The following data represent the number of dry nights out of 14 in two groups
of bedwetters. The first group were treated with drug X and then a placebo
and the second group were treated with the placebo then drug x. An acceptable
washout period was allowed between these two treatments.
Group I: Drug X Placebo Group II: Placebo Drug X
8 5 12 11
14 10 6 8
8 0 13 9
9 7 8 8
11 6 8 9
3 5 4 8
6 0 8 14
0 0 2 4
13 12 8 13
10 2 9 7
7 5 7 10
13 13 7 6
8 10
7 7
9 0
10 6
2 2
To analyse these data in Arcus you must first prepare them in four worksheet
columns appropriately labelled. Then select crossover from the analysis of
variance menu of the analysis section. When asked for baseline levels just
press Esc for none. Select a 95% confidence interval by pressing the enter
key when prompted by the confidence interval menu.
For this example:
Test for relative effectiveness of drug / placebo:
t = 3.526533 P = 0.0007 ***
Test for treatment effect:
diff 1 - diff 2 = 4.073529 SE = 1.2372
effect magnitude = 2.036765 95% CI = .7679056 to 3.305624
t = 3.292539 DF = 27 P = 0.0014 **
Test for period effect:
t = 1.271847 P = 0.1071
Test for treatment / period interaction:
t = -1.299673 P = 0.1024
Here the absence of a statistically significant period effect or treatment
period interaction enables us to quote the statistically highly significant
effect of drug vs placebo. With 95% confidence we can say that the true
population value for the magnitude of the treatment effect lies somewhere
between 0.77 and 3.31 extra dry nights each fortnight.
¬<p values>╪29175 ¬
¬<analysis of variance>╪158578 ¬
¬<reference list>╪310584 ¬
|Kruskal-Wallis| test
This is a method for comparing k independent random samples and can be used as
a nonparametric alternative to the one way ANOVA. In addition to independence
within the samples there must be mutual independence between the samples. The
data must also have been measured using a scale which is at least ordinal. If
the test is significant then you may conclude that at least one of the samples
tends to yield larger observations than at least one of the others. In the
presence of tied ranks the test statistic is given in adjusted and unadjusted
forms, (opinion varies concerning the handling of ties). Approximate
probability is evaluated from a chi-square distribution with k-1 degrees of
freedom. For small samples you may wish to refer to tables of the Kruskal-
Wallis test statistic but the chi-square approximation is highly satisfactory
in most cases. If this test achieves significance you are given the chance to
make multiple comparisons between the samples. You may choose the level of
significance for these comparisons but this is usually α = 0.05 which is the
default on pressing the enter key. All possible comparisons are made and the
probability of each presumed "non-difference" is indicated. For further
information about this method please refer to Conover (ref 6).
EXAMPLE (from Conover ref 6 p 231):
The following data represent corn yields per acre from four different fields
where different farming methods were used.
Method 1 Method 2 Method 3 Method 4
83 91 101 78
91 90 100 82
94 81 91 81
89 83 93 77
89 84 96 79
96 83 95 81
91 88 94 80
92 91 81
90 89
84
To analyse these data in Arcus you must first prepare them in four worksheet
columns appropriately labelled. Then select Kruskal-Wallis from the analysis
of variance menu of the analysis section. Enter the number of groups as four.
For this example:
Adjusted for ties: T = 25.62883 P < 0.0001 ***
Method 1 and Method 2 P = 0.0078 **
Method 1 and Method 3 P = 0.0044 **
Method 1 and Method 4 P < 0.0001 ***
Method 2 and Method 3 P < 0.0001 ***
Method 2 and Method 4 P = 0.0001 ***
Method 3 and Method 4 P < 0.0001 ***
From the overall T we see a statistically highly significant tendency for at
least one group to give higher values than at least one of the others.
Subsequent contrasts show a significant separation of all groups.
¬<p values>╪29175 ¬
¬<analysis of variance>╪158578 ¬
¬<reference list>╪310584 ¬
|Friedman| Test
This method compares several related samples and can be used as a nonparametric
alternative to the two way ANOVA. It is assumed that the results within one
block do not influence the results within other blocks. If the test is
significant then at least one of the treatments tends to yield larger
observations than at least one of the other treatments. The power of this
method is low with small samples but it is the best method for nonparametric two
way analysis of variance with sample sizes above five. When the test is
significant Arcus allows you to make multiple comparisons between the individual
samples. These comparisons are performed automatically for all possible
contrasts and you are informed of the statistical significance of each contrast.
Please note that the overall test statistic is T2 as defined by Inman and
Davenport (1980) and this is tested against the f distribution. Older
literature advocates the use of T3 tested against the chi-square distribution
but this has been shown to be an inferior approach. For further information
please refer to Conover (ref 6).
EXAMPLE (from Conover ref 6 301):
The following data represent the rank preferences of twelve home owners for
four different types of grass planted in their gardens for a trial period.
They considered defined criteria before ranking each grass between 1 (best)
and 4 (worst).
Grass 1 Grass 2 Grass 3 Grass 4
4 3 2 1
4 2 3 1
3 1.5 1.5 4
3 1 2 4
4 2 1 3
2 2 2 4
1 3 2 4
2 4 1 3
3.5 1 2 3.5
4 1 3 2
4 2 3 1
3.5 1 2 3.5
To analyse these data in Arcus you must first prepare them in four worksheet
columns appropriately labelled. Then select Friedman from the analysis of
variance menu in the analysis section. Enter the number of groups as four.
For this example:
T2 = 3.192198 P = 0.0362 *
Grass 1 - Grass 2 P = 0.0149 *
Grass 1 - Grass 3 P = 0.0226 *
Grass 1 - Grass 4 P = 0.4834
Grass 2 - Grass 3 P = 0.8604
Grass 2 - Grass 4 P = 0.0717
Grass 3 - Grass 4 P = 0.1017
From the overall test statistic we can conclude that there is a statistically
significant tendency for at least one group to yield higher values than at
least one of the other groups. Considering the raw data and the contrast
results we see that grasses 2 and 3 are significantly preferred above grass 1
but that there is little to choose between 2 and 3.
¬<p values>╪29175 ¬
¬<analysis of variance>╪158578 ¬
¬<reference list>╪310584 ¬
|Multiple Contrasts| and ANOVA
The multiple contrast or simultaneous inference situation arises when you want
to make pairwise comparisons between many groups after an analysis of variance.
When multiple comparisons are made you are in danger of type I error using t
tests alone, therefore, more conservative approaches are required. Arcus offers
you methods due to Scheffé, Newman-Keuls and gives Bonferroni's limitation with
t tests (ref 4, 13, 22).
With the Newman-Keuls method, means are first ordered in sequence then each
possible discrete comparison is made. The probability associated with the
resultant q values are then derived from the Studentized range.
For Scheffé's test all possible linear contrasts are also made automatically. Please note
Scheffé's is the most conservative method of all.
In the presence of a control group some authors recommend Dunnett's method and
there are more powerful contrast methods for controls such as that due to the
late D. A. Williams. These are not presently offered by Arcus but you CAN use
one of the methods which are in the current version of Arcus, they will just be
a little more conservative.
I recommend the Newman-Keuls method for general use. It is the most soundly
justifiable approach for most multiple contrast situations. You will not find
it in many other stats packages because it is difficult to program and no other
reason (ref 4, 22).
This is a controversial area in statistics and you would be wise to seek the
advice of a statistician before you design your study. In general you should
design experiments so that you can avoid having to "dredge" groups of data for
differences, decide which contrasts you are interested in at the outset. If you
can identify contrasts at the design stage of an experiment then subsequent use
of t tests is justified provided the basic assumptions of the t test are met.
¬<analysis of variance>╪158578 ¬
|Survival Analysis|
¬<Kaplan-Meier>╪182964 ¬
¬<Simple life table>╪194318 ¬
¬<Log-rank and Wilcoxon>╪199215 ¬
¬<Wei-Lachin>╪206225 ¬
This section offers facilities for the description and comparison of survival
experience in different groups. Unlike other Arcus functions the survival
analysis section does not use separate variables for different groups. The
groups are indicated by a group variable which contains group identifiers, i.e.
for 2 groups you would have a column of 1's and 2's in the worksheet. Each of
the data in this column (variable) give a group identity to their rows with
respect to time, death and censorship data in adjacent columns.
|Kaplan-Meier|
This provides the Kaplan-Meier product limit estimates of the survivor (S) and
cumulative hazard (H) functions. Results are displayed for one group at a time
and you have the option to save these results as worksheet variables. If you
choose to save results as worksheet variables then the results are extended to
include confidence intervals for the survivor and cumulative hazard functions.
The variance estimates are approximations based on Greenwood's formula, these
may differ slightly from results obtained using other packages. The confidence
interval for the survivor function is not a simple application of Greenwood's
variance approximation because this would give impossible results (< 0 or > 1)
at extremes of S. The confidence interval for S uses an asymptotic maximum
likelihood solution by the transformation recommended by Kalbfleisch and
Prentice (ref 25). You are also given the option to plot these functions.
Four different plots are given and certain distributions are indicated if
these plots display linearity (ref 24, 25). The plots and their associated
distributions are:
PLOT DISTRIBUTION INDICATED IF LINEAR
H vs Time Exponential, through the origin with slope lambda
ln(H) vs ln(Time) Weibull, intercept beta and slope ln(lambda)
Z(S) vs ln(Time) Log-normal
H/Time vs Time Linear hazard rate
DEFINITIONS:
Let survival time = time to event/failure (here = death)
S = survivor function
H = hazard function
S = │estimated probability of surviving day t│ x │estimated % surviving up to│%
│for those alive at start of day t. │ │day t. │
H = risk of death at time t
BEYOND ARCUS:
Arcus offers you the basic construction of survivor and hazard estimates with
their confidence intervals. If you want to go further and fit models to these
functions then you require specialist software. At this point most researchers
should seek statistical advice. You should aim to fit these models using a
maximum likelihood procedure. Beware, you might need to construct a novel
non-linear model for your data. The commonest model is exponential but Weibull,
log-normal, log-logistic and Gamma often appear.
If the hazard function is constant over time then a plot of log hazard function
vs time will be linear with slope lambda. If this is true then you have the
useful relationship Probability(survival > t) = exp(-lambda * t). This eases
the calculation of relative risk from the ratio of hazard functions at time t
on two survival curves. When the hazard function depends on time then you can
usually calculate relative risk after fitting Cox's proportional hazards model.
This model assumes that for each group the hazard functions are proportional
at each time, it does not assume any particular distribution function for the
hazard function. Proportional hazards modelling can be very useful, however,
most researchers should seek statistical guidance with this.
SAS includes some good routines for modelling survival data but you might
require Genstat, GLIM or MLP for more exploratory work.
EXAMPLE (from Kalbfleisch & Prentice ref 25, p 14):
Death from vaginal cancer after exposure to the carcinogen DMPA was measured
in two groups of rats. Group 1 had a different DMPA pre-treatment régime to
group 2. The time from pre-treatment to death is recorded. If a rat was still
living at the end of the experiment or it had died from a different cause then
that time is considered "censored". A censored observation is given the value
0 in the death/censorship variable to indicate a "non-event".
Group 1: 143, 164, 188, 188, 190, 192, 206, 209, 213, 216, 220, 227, 230,
234, 246, 265, 304, 216*, 244*
Group 2: 142, 156, 163, 198, 205, 232, 232, 233, 233, 233, 233, 239, 240,
261, 280, 280, 296, 296, 232, 204*, 344*
* = censored data
To analyse these data in Arcus you must first prepare them in three worksheet
columns appropriately labelled:
Group Time Death/Censorship
2 142 1
1 143 1
2 156 1
2 163 1
1 164 1
1 188 1
1 188 1
1 190 1
1 192 1
2 198 1
2 204 0
2 205 1
1 206 1
1 209 1
1 213 1
1 216 0
1 216 1
1 220 1
1 227 1
1 230 1
2 232 1
2 232 1
2 232 1
2 233 1
2 233 1
2 233 1
2 233 1
1 234 1
2 239 1
2 240 1
1 244 0
1 246 1
2 261 1
1 265 1
2 280 1
2 280 1
2 296 1
2 296 1
1 304 1
2 323 1
2 344 0
Then select the Kaplan-Meier function from the survival analysis menu of the
analysis section. Select Y when you are asked whether or not you want to save
various statistcs to the worksheet. Select a 95% confidence interval by
pressing enter when prompted with the confidence interval menu. Select Y when
you are prompted about displaying plots.
For Group 1:
Here are the product limit estimates of survival and hazard to the times
observed in the experiment:
Time At Risk Dead Censored S Var S H Var H
143 19 1 0 0.94737 0.00262 0.05407 0.00292
164 18 1 0 0.89474 0.00496 0.11123 0.00619
188 17 2 0 0.78947 0.00875 0.23639 0.01404
190 15 1 0 0.73684 0.01021 0.30538 0.0188
192 14 1 0 0.68421 0.01137 0.37949 0.02429
206 13 1 0 0.63158 0.01225 0.45953 0.0307
209 12 1 0 0.57895 0.01283 0.54654 0.03828
213 11 1 0 0.52632 0.01312 0.64185 0.04737
216 10 1 1 0.47368 0.01312 0.74721 0.05848
220 8 1 0 0.41447 0.01311 0.88075 0.07634
227 7 1 0 0.35526 0.01264 1.0349 0.10015
230 6 1 0 0.29605 0.0117 1.21722 0.13348
234 5 1 0 0.23684 0.01029 1.44036 0.18348
244 4 0 1 0.23684 0.01029 1.44036 0.18348
246 3 1 0 0.15789 0.00873 1.84583 0.35015
265 2 1 0 0.07895 0.0053 2.53897 0.85015
304 1 1 0 0 0 ∞ 0
And with 95% confidence interval for S...
Time At Risk Survivor (S) 95% LCI S 95% UCI S
143 19 .9473684 .6811868 .9924147
164 18 .8947369 .6407944 .9725854
188 17 .7894737 .5319126 .9152861
190 15 .7368422 .4789329 .8810194
192 14 .6842106 .4279407 .8439419
206 13 .631579 .3789929 .804409
209 12 .5789474 .3320811 .76264
213 11 .5263159 .2872013 .7187639
216 10 .4736843 .2443767 .6728407
220 8 .4144737 .1961606 .6211132
227 7 .3552632 .1519129 .5664639
230 6 .2960527 .1116839 .5087005
234 5 .2368421 7.577927E-02 .4474698
244 4 .2368421 7.577927E-02 .4474698
246 3 .1578947 3.143191E-02 .3735425
265 2 7.894737E-02 5.665417E-03 .2876329
304 1 0 0 0
Below is the classical "survival plot" showing how survival declines with time.
If you want a high resolution plot of this then feed the data saved to the
worksheet through the survival plot function of the pictorial statistics menu.
Survivor
1.00+
│B
│A B
│ BA
│ B .
0.75+ A B
│ A
│ A
│ A B
│ A
0.50+ A
│ A
│ A B
│ A B
│ B
0.25+ A B
│ A .
│ A B
│
│ A B
0.00+ A B .
/+────────-+────────-+────────-+────────-+────────-+────────-+
140 180 220 260 300 340 380
Times
The approximate linearity of the log hazard vs log time plot below indicates a
Weibull distribution of survival.
Log Hazard
1.70+
│
│
│ B .
│ A B
0.45+ A B
│ A . B
│ AA B
│ AA B
│ A
-0.80+ AA B
│ A
│ A
│ A B
│ B .
-2.05+ B
│ A
│ B
│
│A
-3.30+B
/+────────-+────────-+────────-+────────-+────────-+────────-+
4.95 5.10 5.25 5.40 5.55 5.70 5.85
Log Times
At this point you might be wanting to run a formal hypothesis test to see if
there is any statistical evidence for two or more survival curves being
different. This can be achieved using sensitive parametric methods if you have
fitted a particular distribution curve to your data. More often you would use
the ¬Log-rank and Wilcoxon╪199215 ¬ tests which do not assume any particular
distribution of the survivor function.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Simple Life Table|
This function provides a simple life table which displays the survival
experience of a group of individuals or cohort, this is much like the table
originally proposed by Berkson and Gage (ref 4, 5, 24, 25). The confidence
interval for lx is not a simple application of the estimated variance. Instead
it uses a maximum likelihood solution from an asymptotic distribution by
the transformation of lx suggested by Kalbfleisch and Prentice (ref 25). This
treatment of lx avoids impossible values (i.e. >1 or <0).
DEFINITIONS:
INTERVAL For a full life table this is ages in single years.
For an abridged life table this is ages in groups.
For a Berkson and Gage survival table this is the survival times
in intervals.
DEATHS Number of individuals who die in the interval.
W'DRAWN Number of individuals withdrawn or lost to follow up in the
interval.
AT RISK Number of individuals alive at the start of the interval.
N'x Adjusted number at risk (half of withdrawals of current interval
subtracted).
q Probability that an individual who survived the last interval will
die in the current interval.
p Probability that an individual who survived the last interval will
survive the current interval.
lx Probability of an individual surviving beyond the current interval.
Proportion of survivors after the current interval.
Life table survival rate.
Var(lx) Estimated variance of lx.
X% LCI lx Lower x% confidence interval for lx.
X% UCI lx Upper x% confidence interval for lx.
EXAMPLE (from Armitage ref. 5 p 425):
The following data represent the survival of a 374 patients who had one type of
surgery for a particular malignancy:
Years since operation Died in this interval Lost to follow up
1 90 0
2 76 0
3 51 0
4 25 12
5 20 5
6 7 9
7 4 9
8 1 3
9 3 5
10 2 5
To analyse these data in Arcus you must first prepare them in three worksheet
columns appropriately labelled. Then select the simple life table from the
survival analysis menu of the analysis section. Enter the number at the start
as 374. Select a 95% confidence interval by pressing enter when prompted by the
confidence interval menu.
For this example:
Interval Deaths W'drawn At Risk N'x q p
0- 90 0 374 374 0.2406417 0.7593583
1- 76 0 284 284 0.2676056 0.7323943
2- 51 0 208 208 0.2451923 0.7548077
3- 25 12 157 151 0.1655629 0.8344371
4- 20 5 120 117.5 0.1702128 0.8297873
5- 7 9 95 90.5 0.07734807 0.9226519
6- 4 9 79 74.5 0.05369128 0.9463087
7- 1 3 66 64.5 0.01550388 0.9844961
8- 3 5 62 59.5 0.05042017 0.9495798
9- 2 5 54 51.5 0.03883495 0.9611651
10- - - 47 - - -
Interval p lx Var(lx) 95% LCI lx 95% UCI lx
0- 0.7593583 1 - - -
1- 0.7323943 0.7593583 0.00048859 0.7127129 0.7995125
2- 0.7548077 0.5561497 0.00066002 0.5042839 0.6048234
3- 0.8344371 0.4197861 0.00065125 0.3694556 0.4692234
4- 0.8297873 0.3502851 0.00061468 0.3020018 0.3988916
5- 0.9226519 0.2906621 0.00057073 0.2447156 0.33805
6- 0.9463087 0.26818 0.00055247 0.2232208 0.3150406
7- 0.9844961 0.253781 0.00054379 0.2093514 0.3004384
8- 0.9495798 0.2498464 0.0005423 0.2055291 0.2964883
9- 0.9611651 0.2372491 0.00053922 0.1932333 0.2839237
10- - 0.2280356 0.00053895 0.1932333 0.2839237
Thus we can conclude with 95% confidence that the true population survival rate
5 years after this operation lies somewhere between 24.5% and 33.8% for
patients who present with this malignancy.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Log-Rank and Wilcoxon|
These are two methods for comparing two or more survival curves. These methods
do not make any assumptions about the distributions of the survival estimates
which comprise the curves. The null hypothesis that the risk of death is the
same in all groups is tested. Peto's log-rank test is generally the most
appropriate method but the modified Wilcoxon test is more sensitive when the
ratio of hazards is higher at early survival times than at late ones. An
optional variable, strata, allows you to sub-classify the groups specified
in the group identifier variable and to test the significance of this
sub-classification (ref 4, 24, 25).
EXAMPLE (from Armitage ref 4 p 431): The following data represent the survival
in days since entry to the trial of patients with diffuse histiocytic lymphoma.
Two different groups of patients, those with stage III and those with stage IV
disease, are compared.
Stage 3: 6, 19, 32, 42, 42, 43*, 94, 126*, 169*, 207, 211*, 227*, 253, 255*,
270*, 310*, 316*, 335*, 346*
Stage 4: 4, 6, 10, 11, 11, 11, 13, 17, 20, 20, 21, 22, 24, 24, 29, 30, 30,
31, 33, 34, 35, 39, 40, 41*, 43*, 45, 46, 50, 56, 61*, 61*, 63, 68,
82, 85, 88, 89, 90, 93, 104, 110, 134, 137, 160*, 169, 171, 173,
175, 184, 201, 222, 235*, 247*, 260*, 284*, 290*, 291*, 302*, 304*,
341*, 345*
* = censored data (patient still alive or died from an unrelated cause)
To analyse these data in Arcus you must first prepare them in three worksheet
columns as shown below:
group time censor
1 6 1
1 19 1
1 32 1
1 42 1
1 42 1
1 43 0
1 94 1
1 126 0
1 169 0
1 207 1
1 211 0
1 227 0
1 253 1
1 255 0
1 270 0
1 310 0
1 316 0
1 335 0
1 346 0
2 4 1
2 6 1
2 10 1
2 11 1
2 11 1
2 11 1
2 13 1
2 17 1
2 20 1
2 20 1
2 21 1
2 22 1
2 24 1
2 24 1
2 29 1
2 30 1
2 30 1
2 31 1
2 33 1
2 34 1
2 35 1
2 39 1
2 40 1
2 41 0
2 43 0
2 45 1
2 46 1
2 50 1
2 56 1
2 61 0
2 61 0
2 63 1
2 68 1
2 82 1
2 85 1
2 88 1
2 89 1
2 90 1
2 93 1
2 104 1
2 110 1
2 134 1
2 137 1
2 160 0
2 169 1
2 171 1
2 173 1
2 175 1
2 184 1
2 201 1
2 222 1
2 235 0
2 247 0
2 260 0
2 284 0
2 290 0
2 291 0
2 302 0
2 304 0
2 341 0
2 345 0
Next select the Log-rank and Wilcoxon function from the survival analysis
menu of the analysis section.
For this example:
relative death rate for stage 3 = .4794143
relative death rate for stage 4 = 1.232816
Log-rank test
Chi-square for equivalence of death rates = 6.70971 P = 0.0096 **
Generalised Wilcoxon test
Chi-square for equivalence of death rates = 3.936735 P = 0.0472 *
You can see that both tests have demonstrated a statistically significant
difference in survival experience between stage 3 and stage 4 patients in
this study.
Stratified example: (from Peto et al. ref 40)
Group Identifier Trial Times Censorship (Strata, optional)
1 8 1 (event = death) 1 (renal impairment)
1 8 1 2 (no renal impairment)
2 13 1 1
2 18 1 1
2 23 1 1
1 52 1 1
1 63 1 1
1 63 1 1
2 70 1 2
2 70 1 2
2 180 1 2
2 195 1 2
2 210 1 2
1 220 1 2
1 365 0 (lost to f.u.) 2
2 632 1 2
2 700 1 2
1 852 0 (surviving) 2
2 1296 1 2
1 1296 0 2
1 1328 0 2
1 1460 0 2
1 1976 0 2
2 1990 0 2
2 2240 0 2
The table above shows you how to prepare data for a stratified log-rank test
in Arcus. This example is worked through in the second of two classic papers
by Richard Peto and colleagues (ref 39, 40). If you want to understand survival
analysis then I strongly advise you to read these two papers. Please note that
Arcus uses the more exact variance formulae mentioned in the statistical notes
section at the end of ref 40.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Wei-Lachin|
This provides a two sample distribution free analysis for the comparison of two
multivariate distributions of survival / time-to-event data which may be
incomplete / censored. The method uses the random censorship model to apply
generalisations of the log-rank test and the Gehan generalised Wilcoxon test.
(ref A21, 26). Arcus asks you for a group identifier variable which should be
a vector of 1's and 2's representing the two groups. You then identify n pairs
of time-to-event and censorship variables for the n repeat times which you have
specified. Censored data are coded as 0 and 1 represents uncensored data in
the censorship variable. Repeat times may represent separate factors or the
observation of the same factor repeated on n occasions. For example, time to
develop symptoms could be analysed for n different symptoms in a group of
patients treated with drug x and compared with a group of patients not treated
with drug x. For further details please refer to the excellent paper by
Robert Makuch et. al. from which this Arcus function was developed (Ref A21).
EXAMPLE (from Makuch ref A21): The following data represent the times in days
it took in vitro cultures of lymphocytes to reach a level of p24 antigen
expression. The cultures where taken from patients infected with HIV-1 who had
advanced AIDS or AIDS related complex. The idea was that patients whose
cultures took a short time to express p24 antigen had a greater load of HIV-1.
The two groups represented patients on two different treatments. The culture
was run for 30 days and specimens which remained negative or which became
contaminated were called censored (=0). The tests were run over four 30 day
periods:
Treatment Time 1 Cens 1 Time 2 Cens 2 Time 3 Cens 3 Time 4 Cens 4
Group
1 8 1 0 0 25 0 21 1
1 6 1 4 1 5 1 5 1
1 6 1 5 1 28 0 18 1
1 14 0 35 0 23 1 19 0
1 7 1 0 0 13 1 0 0
1 5 1 4 1 27 1 8 1
1 5 1 21 0 6 1 14 1
1 6 1 10 1 14 1 18 1
1 7 1 4 1 15 1 8 1
1 6 1 5 1 5 1 5 1
1 4 1 5 1 6 1 3 1
1 5 1 4 1 7 1 5 1
1 21 0 5 1 0 0 6 1
1 13 1 27 0 21 0 8 1
1 4 1 27 0 7 1 6 1
1 6 1 3 1 7 1 8 1
1 6 1 0 0 5 1 5 1
1 6 1 0 0 4 1 6 1
1 7 1 9 1 6 1 7 1
1 8 1 15 1 8 1 0 0
1 18 0 27 0 18 0 9 1
1 16 1 14 1 14 1 6 1
1 15 1 9 1 12 1 12 1
2 4 1 5 1 4 1 3 1
2 8 1 22 1 25 0 0 0
2 6 1 6 1 8 1 5 1
2 7 1 10 1 10 1 18 1
2 5 1 14 1 17 0 6 1
2 3 1 5 1 8 1 6 1
2 6 1 11 1 6 1 13 1
2 6 1 0 0 15 1 7 1
2 6 1 12 1 19 1 8 1
2 6 1 25 0 0 0 22 0
2 4 1 7 1 5 1 7 1
2 5 1 7 1 4 1 6 1
2 3 1 9 1 7 1 6 1
2 9 1 17 1 0 0 21 0
2 6 1 4 1 8 1 14 1
2 5 1 5 1 7 1 16 0
2 12 1 18 0 14 1 0 0
2 9 1 11 1 15 1 18 0
2 6 1 5 1 9 1 0 0
2 18 0 8 1 10 1 13 1
2 4 1 4 1 5 0 10 1
2 3 1 10 1 0 1 21 0
2 8 1 7 1 10 1 12 1
2 3 1 6 1 7 1 9 1
To analyse these data in Arcus you must first prepare them in 9 worksheet
columns as shown above. Then select the Wei-Lachin function from the survival
analysis menu of the analysis section. Enter number of repeat times as 4.
For this example:
Univariate generalised Wilcoxon tests:
repeat time = 1
chi-square = 3.588261 P = 0.0582
repeat time = 2
chi-square = .1071885 P = 0.7434
repeat time = 3
chi-square = .2164523 P = 0.6418
repeat time = 4
chi-square = 1.996144 P = 0.1577
Multivariate generalised Wilcoxon test:
chi squared omnibus statistic = 9.242916 P = 0.0553
stochastic ordering chi-square = 9.598206E-02 P = 0.7567
Univariate log-rank tests:
repeat time = 1
chi-square = 3.344057 P = 0.0674
repeat time = 2
chi-square = .5345362 P = 0.4647
repeat time = 3
chi-square = .9179572 P = 0.3380
repeat time = 4
chi-square = 2.675657 P = 0.1019
Multivariate log-rank test:
chi squared omnibus statistic = 9.52966 P = 0.0491 *
stochastic ordering chi-square = .4743826 P = 0.4910
Here the multivariate log-rank test has revealed a statistically significant
difference between the treatment groups which was not revealed by any of the
individual univariate tests. For more detailed discussion of each result
parameter please see Wei and Lachin's original paper (ref 26).
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Instant Functions| (Non-Worksheet oriented analysis)
¬<Distributions>╪213522 ¬
¬<Chi-square tests>╪222593 ¬
¬<Exact tests>╪242962 ¬
¬<Proportions>╪262904 ¬
¬<Sample Size>╪256010 ¬
¬<Randomisation>╪252007 ¬
¬<Miscellaneous>╪269298 ¬
These functions are referred to as instant because they do not require columns
of numbers to have been prepared in advance using the Arcus worksheet. You are
prompted for the relevant data within the function.
Statistical Probability |Distributions|
This section deals with the commonly used statistical probability distributions.
Robust, reliable algorithms have been employed to provide a high level of
accuracy, thus most tail areas are given to six decimal places. For practical
purposes the p values given with hypothesis tests throughout Arcus are displayed
to four decimal places.
¬<Normal>╪218237 ¬
¬<Chi-square>╪218665 ¬
¬<Student's t>╪219206 ¬
¬<F (variance ratio)>╪219707 ¬
¬<Studentized range Q>╪220173 ¬
¬<Spearman's rho>╪221715 ¬
¬<Kendall's tau>╪222143 ¬
¬<binomial>╪220751 ¬
¬<Poisson>╪221217 ¬
PROBILITY DISTRIBUTIONS
-----------------------
Probability exists as a concept to help us predict the chance of something
happening (an outcome) based on observations of this outcome in the past.
In mathematical language, this outcome is described in terms of a random
variable. The random variable can take on different values which represent
different outcomes, eg blood pressure. This sort of random variable can be
thought of in infinitely small units of measurement where the steps between
the units are so small that they become continuous, this is a continuous
random variable. The other kind of random variable is called discrete.
Discrete random variables take on discrete outcomes such as the number of
times an asthmatic patient has been admitted to hospital with an acute
exacerbation. If you consider an outcome measured in many different
individuals in a population then you are starting to build up a model of this
outcome within that population. If you then plot all of the values of this
outcome on a histogram you might find a particular shape emerging every time
you take a large random sample from this population. With a continuous random
variable you can draw a curve around the histogram because it is possible to
have values in between any that are measured. With a discrete variable,
however, there may only be a few possible outcomes so your histogram will have
wide bars with definite steps between them. This is like the difference
between a digital signal (steps) and an analogue signal (curves).
Now comes the all important linking concept, probability distribution. We have
discussed how the different values of an outcome can be plotted on a histogram
with some values occurring more frequently than others. Thus the commonly
occurring values have a higher probability of being observed when you take a
random sample of that population.
def A probability distribution of a random variable is a table, graph or
mathematical expression giving the probabilities with which the random
variable takes different values.
Putting numbers to this concept involves more thought about populations.
Think of a graph of probability (p) plotted against the value of outcome (x).
A probability distribution would include all possible values for x. The sum
of p for all possible values of x is defined as 1. For discrete variables
this is literally a simple summation but for continuous variables the number of
possible values of x is infinite so we use integration to estimate the area
under the curve. This area is of course 1 for the total curve. Now consider
one value of x. You can use the probability distribution for x to estimate the
chance of observing that x at random in the population. For discrete
distributions we do literally calculate p but for continuous distributions we
consider a partial area under the curve or probability density function which
represents the probability that x lies between 2 specified values.
Most of the time you will be dealing outcomes which are values of a statistic
calculated as a test of some hypothesis. The so called test statistic can
usually be compared with one of the standard probability distributions. The
p value derived from this test statistic is then used to accept or refute the
test hypothesis with an accepted level of certainty. This sort of result often
gives a false sense of security as it says nothing about the assumptions of your
test. The use of confidence intervals gives a more realistic representation of
a test result but it most certainly does NOT reflect a test used with invalid
assumptions. Please read the help text regarding assumptions when you are
using any of the hypothesis tests in Arcus.
Discrete distributions: eg Binomial, Poisson
Continuous distributions: eg Normal, Chi-square, Student's t, F
If you need more information about probability and sampling theory then please
consult one of the introductory or core texts listed in the reference section.
|Normal| (Gaussian)
The normal distribution is the most important continuous probability
distribution. It was first described by De Moivre in 1733 and subsequently by
the German mathematician C. F. Gauss (1777 - 1885). Arcus gives you the tail
areas and percentage points for this function. Please note that the upper and
lower tails are not simply 1.0 minus the other. (ref A3, A4)
¬<Distributions>╪213522 ¬
|Chi-square|
The chi-square statistic is related to the sum of squares of a number of
standard normal variables and is associated with a positively (left) skewed
distribution which approaches symmetry as the sample size increases. Arcus can
be used to calculate the probability associated with a chi-square random
variable with given degrees of freedom and to calculate the percentage points
of this distribution (ref A5). A reliable approach to the incomplete gamma
integral is used (ref A16).
¬<Distributions>╪213522 ¬
|Student's t|
t represents a family of distributions which are shaped by nu degrees of
freedom. When nu is infinite t becomes a normal distribution. This family of
distributions is associated with W. S. Gosset who, at the turn of the century,
published his work under the pseudonym Student. Arcus uses the relationship
between Student's t and Snedecor's f to calculate the tail areas and percentage
points of t distributions for given degrees of freedom.
¬<Distributions>╪213522 ¬
|F (variance ratio)|
Snedecor's f describes the distribution of variance estimates of two samples,
each from a normal distribution. The size of each sample is reflected in the
degrees of freedom nu1 and nu2. Arcus calculates tail areas and percentage
points for given numerator (nu1) and denominator (nu2) degrees of freedom.
Reliable approaches to the beta function are used in these calculations
(ref A7, A8, A9, A10).
¬<Distributions>╪213522 ¬
|Studentized Range Q|
The Studentized range, Q, is the range of means divided by the estimated
standard error for a given group of samples. This is often used in multiple
comparison / simultaneous inference methods which accompany analyses of
variance. Arcus calculates tail areas and percentage points for a given number
of samples and sample sizes. Please note that these calculations are highly
complex and will take longer than any of the other distribution functions
particularly with large numbers of samples (ref A11, A12).
¬<Distributions>╪213522 ¬
|Binomial|
The binomial distribution describes a random variable which is the number of
successes in n trials. There must be only two outcomes to the trial, success
or failure. Each of the n repetitions of this trial must also be completely
independent. Arcus calculates cumulative probabilities for (>=, <=, =) r
successes in n trials. Confidence intervals for binomial proportions are given
with the Arcus sign test.
¬<Distributions>╪213522 ¬
|Poisson|
The Poisson distribution represents the probabilities of r events occurring
independently and at random in certain defined circumstances with mean µ.
This approximates a binomial distribution when the number of trials is large
and the probability of success on each trial is small. Arcus calculates
cumulative probabilities that (<=, >=, =) r random events are contained in an
interval when the average number of such events per interval is µ.
¬<Distributions>╪213522 ¬
|Spearman's Rho| / Hotelling-Pabst
Given a value for the Hotelling-Pabst test statistic (T) or Spearman's rho this
function calculates the probability of obtaining a value greater than or equal
to T. Upper tail probabilities are calculated using a recurrence method when
n < 7 and the Edgeworth series expansion when n >= 7. The maximum error for any
probability is 0.0004 (ref A13).
¬<Distributions>╪213522 ¬
|Kendall's Tau|
Given a value for the test statistic (S) associated with Kendall's tau this
function calculates the probability of obtaining a value greater than or equal
to S for a given sample size. Upper tail probabilities are calculated using a
recurrence method when n < 9 and an improved Edgeworth series expansion when
n >= 9 (ref A14). The two samples are assumed to have been ranked without ties.
¬<Distributions>╪213522 ¬
|Chi-square Tests|
¬<2 by 2>╪223812 ¬
¬<2 by k>╪230817 ¬
¬<r by c>╪227666 ¬
¬<Matched pairs (McNemar, Liddell)╪233702 ¬
¬<Mantel-Haenszel>╪235832 ¬
¬<Woolf>╪239133 ¬
Chi-square tests compare observed and expected frequencies of individuals
grouped by different categories. Arcus applies the basic chi-square analysis
to a number of different contingency table designs. The larger the resultant
chi-square statistic (for given degrees of freedom) the more likely there is
to be a significant difference between observed and expected frequencies. A null
hypothesis that there is no difference between the populations from which you
quantify observed and expected frequencies is tested by comparing the calculated
chi-square statistic with percentage points of the chi-square distribution.
This is valid provided that the numbers are not too small, in general any
expected frequency should be greater than five.
|Haldane| correction
This is a method used to avoid error in the calculation of some of the chi-
square tests in Arcus. It involves adding 0.5 to all of the cells of a
contingency table if any of the cell expectations would cause a division by
zero error.
|2 by 2| contingency table chi-square test
The two by two or fourfold contingency table is commonly used to compare two
proportions. The rows represent two classifications of one variable (e.g.
infection/no infection) and the columns represent two classifications of another
variable (e.g. antiseptic wash/no antiseptic). These classifications must be
independent. Paired results (e.g. same group of individuals before and after
antiseptic wash) should be analysed using a test for ¬matched pairs╪233702 ¬.
Fisher's exact test should be used as an alternative to the fourfold chi-square
test if the total number is less than twenty or any of the expected frequencies
are less than five. In practical terms, however, there is little point in using
the fourfold chi-square test when Arcus provides you with a Fisher's exact test
which can cope with reasonably large numbers. In the fourfold chi-square test
you are advised to use the Yates' corrected value as this improves the
approximation of your discrete sample chi-square statistic to a continuous chi
square distribution (ref 4).
The odds ratio of this 2 by 2 table is given and the associated approximate
confidence interval (CI) is calculated using two different methods. The CI
using the logit method for large samples is given first followed by the CI
using Cornfield's method (ref 9, 11). The latter is the most reliable method
but the logit method might be more acceptable if a convergent solution has not
been achieved with Cornfield's method.
EXAMPLE (from Armitage ref 4 p 126):
The following represent mortality data for two groups of patients receiving
different treatments, A and B.
Outcome
Dead Alive
Treatment / Exposure A 41 216
B 64 180
To analyse these data in Arcus you must select the 2 by 2 contingency table
from the chi-square sub-menu of the instant functions menu in the analysis
section. Select a 95% confidence interval by pressing the enter key when
prompted by the confidence interval menu. Enter the frequencies into the
contingency table on screen as shown above.
For this example:
Observed values and totals:
╔════════════════╤════════════════╤════════════════╗
║ 41 │ 216 │ 257 ║
╟────────────────┼────────────────┼────────────────╢
║ 64 │ 180 │ 244 ║
╠════════════════╪════════════════╪════════════════╣
║ 105 │ 396 │ 501 ║
╚════════════════╧════════════════╧════════════════╝
Expected values:
╔════════════════╤════════════════╗
║ 53.86227 │ 203.1377 ║
╟────────────────┼────────────────╢
║ 51.13773 │ 192.8623 ║
╚════════════════╧════════════════╝
Yates-corrected Chi² = 7.370595 P = 0.0066
Coefficient of contingency: V = -0.126198
Using Cornfield's Method for a 95% CI:
Odds ratio (after ¬Haldane╪223546 ¬ correction) = 0.536423
Upper limit: 0.335953
Lower limit: 0.847064
Here we can see a statistically significant relationship between treatment
and mortality. The strength of that relationship is reflected by the
coefficient of contingency. The odds ratio tells us that the odds in favour of
dying after treatment A are about half of the odds of dying after treatment B.
With 95% confidence we put the true population value for this ratio of odds
somewhere between 0.34 and 0.85. If you need to phrase the arguments with
odds ratios the other way around then just quote the reciprocals, i.e. here
we would say that the odds of dying after treatment A are 1.86 times greater
than after treatment B.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|R by C| contingency table chi-square test
The r by c chi-square test extends the chi-square method to any number of
independent categories expressed as r rows and c columns of a contingency
table. The overall test indicates the degree of independence between the
variables which make up the table. An analysis of trend indicates how much of
the difference between the mean scores for the columns can be accounted for by
linear trend. Armitage (ref 4) quotes an example where extent of grief of
mothers suffering a perinatal death, graded I to IV, is compared with the
degree of support received by these women. In this example the overall
statistic is non-significant but a significant trend is demonstrated. The
largest table for a display of individual results is 8 columns by 10 rows but
general results are given for larger tables, with the maximum table size being
limited only by your computer's memory. Observed values, expected values and
totals are given for the table when c <= 8 and r <= 10.
EXAMPLE (from Armitage ref 4 p 378):
The following data (as above) describe the state of grief of 66 mums who had
suffered a neonatal death. The table relates this to the amount of support
given to these women:
Support
Good Adequate Poor
Grief State I 17 9 8
II 6 5 1
III 3 5 4
IV 1 2 5
To analyse these data in Arcus you must select r by c from the chi-square test
menu of the instant functions menu in the analysis section. Press N when asked
about percentages. Choose a 95% confidence interval by pressing the enter key
when prompted by the confidence interval menu. Then select the number of rows
as 4 and the number of columns as 3. You then enter the above data as
directed by the screen.
For this example:
Observed 17 9 8 34
Expected 13.91 10.82 9.27
DChi² 0.69 0.31 0.17
Observed 6 5 1 12
Expected 4.91 3.82 3.27
DChi² 0.24 0.37 1.58
Observed 3 5 4 12
Expected 4.91 3.82 3.27
DChi² 0.74 0.37 0.16
Observed 1 2 5 8
Expected 3.27 2.55 2.18
DChi² 1.58 0.12 3.64
Totals: 27 21 18 66
TOTAL number of cells = 12
WARNING: 9 out of 12 cells have 1 <= EXPECTATION < 5
Overall chi-square = 9.9588 P = 0.1264
Chi-square for equality of mean scores = 5.784033 P = 0.0555
Chi-square for trend in mean scores = 5.746874 P = 0.0165 *
Chi-square for departures from trend = 0.037159 P = 0.8471
Coefficients of contingency:
Pearson's = 0.362088
Cramer's = 0.274673
Here we see that although the overall test was not significant we did show a
statistically significant trend in mean scores. This suggests that supporting
these mothers did help lessen their burden of grief.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|2 by k| contingency table chi-square test
Several proportions can be compared using a two by k chi-square test. For
example, a village can be subdivided into k age groups and counts made of those
individuals with and those without a particular disease marker. From the
overall test you can see whether or not age has a significant effect on the
disease studied. Arcus also performs a test for linear trend across the k
groups. You can opt to enter your own scores for the trend test. For example,
if a variable was categorised as mild, moderate or severe you would want to
enter your own scores if the data were not presented in order (ref 4). You
could equally use the r by c chi-square test for these functions, it just
has a different style of presentation and data input. If you need coefficients
of contingency then you should use the r by c chi-square function.
EXAMPLE (from Armitage ref 4 p 373):
The following data describe numbers of children with different sized palatine
tonsils and their carrier status for Strep. pyogenes.
Tonsils
Present but Enlarged Greatly
not enlarged enlarged
Carriers 19 29 72
Non-carriers 497 269 1326
To analyse these data in Arcus you must select 2 by k from the chi-square test
sub-menu of the instant functions menu in the analysis section. Then select
the middle option from the 2 by k chi-square test menu. Choose a 95% confidence
interval by pressing the enter key when prompted by the confidence interval
menu. Then select the number of rows as 3. You then enter the above data as
directed by the screen. Use carriers as successes and non-carriers as failures.
For this example:
Successes Failures Total Per cent
Observed 19 497 516 3.682171
Expected 26.57511 489.4249
Observed 29 560 589 4.923599
Expected 30.33476 558.6652
Observed 24 269 293 8.191126
Expected 15.09013 277.9099
Total 72 1326 1398 5.150215
Total Chi² = 7.884844 P = 0.0194 *
Chi² for linear trend = 7.192746 P = 0.0073 **
Remaining Chi² (non-linearity) = .6920977 P = 0.4055
Here the total chi-square test shows a statistically significant association
between the classifications, i.e. between tonsil size and Strep. pyogenes
carrier status. We have also shown a significant linear trend which enables
us to refine our conclusions to a suggestion that the proportion of Strep.
pyogenes carriers increases with tonsil size.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Matched pairs (McNemar, Liddell)|
Paired proportions have traditionally been compared using McNemar's test but an
exact alternative is now available (after Liddell 1983). Arcus gives you both.
You enter your data in the 2 by 2 format with discordant cells at top right and
bottom left. The exact test gives you a two tailed probability and exact
confidence limits for the odds ratio. You should use the exact test for your
analysis, McNemar's test is included for interest only.
If you need the exact confidence interval for the difference between the pair
of proportions then please use the "paired proportions" function of the
proportions menu from the instant functions menu of the analysis section.
EXAMPLE (from Armitage ref 4 p 122):
The data below represent a comarison of two media for culturing Mycobacterium
tuberculosis. Fifty suspect sputum specimens were plated up on both media
and the following results were obtained:
Medium B
Growth No Growth
Medium A: Growth 20 12
No Growth 2 16
To analyse these data in Arcus you must select the matched pairs (McNemar,
Liddell) option from the chi-square menu of the instant functions menu in the
analysis section. Select a 95% confidence interval by pressing the enter key
when prompted by the confidence interval menu. Enter the frequencies into the
contingency table on screen as shown above.
For this example:
McNemar's test:
Yates' continuity corrected Chi² = 5.785714 P = 0.0162 *
After Liddell (1983):
Point estimate of relative risk (R') = 6
Exact 95% confidence interval = 1.335772 to 55.07571
F = 4 P (two tailed) = 0.0129 *
R' is significantly different from unity
Here we can conclude that the tubercle bacilli in the experiment grew
significantly better on medium A than on medium B. With 95% confidence we
can state that the chances of a positive culture are between 1.34 and 55.08
times greater on medium A than on medium B.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Mantel-Haenszel| test for a 2 by 2 series
In case-control studies observed frequencies can often be represented by a
series of two by two tables. Each stratum of this series represents
observations taken at different times, different places or another system of
sub-grouping within one large study. The estimation of relative risk can
utilise the method of Mantel and Haenszel or that of Woolf. The Mantel-Haenszel
method is more robust when some of the strata contain small frequencies. Data
for these tests are entered as a series of two by two tables, each table
corresponding to a stratum of your investigation. Each table has the standard
(++), (+-), (-+), (--) format with (-+) and (--) for controls.
The Mantel-Haenszel pooled estimate of the odds ratio is given with test based
approximate confidence limits calculated by the method of Miettinen (ref 4).
The chi-square test statistic is given with associated probability of the odds
ratio being unity.
EXAMPLE (from Armitage ref 4 p 463):
The following data compare the smoking status of lung cancer patients with
controls. Ten different studies are combined in an attempt to improve the
overall estimate of relative risk. The matching of controls has been ignored
because there was not enough information about matching from each study to be
sure that the matching was the same in each study.
Lung cancer Controls
smoker non-smoker smoker non-smoker
83 3 72 14
90 3 227 43
129 7 81 19
412 32 299 131
1350 7 1296 61
60 3 106 27
459 18 534 81
499 19 462 56
451 39 1729 636
260 5 259 28
To analyse these data in Arcus you must select the Mantel-Haenszel function
from the chi-square sub-menu of the instant functions menu in the analysis
section. Select a 95% confidence interval by pressing the enter key when
prompted by the confidence interval menu. Enter the number of tables as 10.
Then enter each row of the table above as a separate 2 by 2 contingency table:
i.e. The first row is entered as:
Smkr Non
╔══════╤══════╗
Lung cancer ║ 83 │ 3 ║
╟──────┼──────╢
control ║ 72 │ 14 ║
╚══════╧══════╝
... this is then repeated for each of the ten rows.
For this example:
Mantel Haenzsel Chi square = 292.3788 P < 0.0001 ***
Mantel Haenzsel pooled estimate of odds ratio = 4.681639
Approximate 95% CI = 3.922422 to 5.587809
Here we can say with 95% confidence that the true population odds in favour of
being a smoker were between 3.9 and 5.6 times greater in patients who had lung
cancer compared with controls. This estimate of the relative risk is obviously
highly significantly different from unity.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Woolf| statistics for 2 by 2 tables & series
In case-control studies observed frequencies can often be represented by a
series of two by two tables. Each stratum of this series represents
observations taken at different times, different places or another system of
sub-grouping within one large study. The estimation of relative risk can
utilise the method of Mantel and Haenszel or that of Woolf. The ¬Mantel-Haenszel╪235832 ¬
method is more robust when some of the strata contain small frequencies. Data
for these tests are entered as a series of two by two tables, each table
corresponding to a stratum of your investigation. Each table has the standard
(++), (+-), (-+), (--) format with (-+) and (--) for controls.
With the Woolf method results for an individual quad of data are displayed after
you have entered that table, please remember this when entering a large series.
When all tables have been entered the combined statistics (¬Haldane╪223546 ¬ corrected),
including chi-square for heterogeneity, are given.
EXAMPLE (from Armitage ref 4 p 463):
The following data compare the smoking status of lung cancer patients with
controls. Ten different studies are combined in an attempt to improve the
overall estimate of relative risk. The matching of controls has been ignored
because there was not enough information about matching from each study to be
sure that the matching was the same in each study.
Lung cancer Controls
smoker non-smoker smoker non-smoker
83 3 72 14
90 3 227 43
129 7 81 19
412 32 299 131
1350 7 1296 61
60 3 106 27
459 18 534 81
499 19 462 56
451 39 1729 636
260 5 259 28
To analyse these data in Arcus you must select the Woolf function from the
chi-square sub-menu of the instant functions menu in the analysis section.
Select a 95% confidence interval by pressing the enter key when prompted by
the confidence interval menu. Enter the number of tables as 10. Then enter
each row of the table above as a separate 2 by 2 contingency table:
i.e. The first row is entered as:
Smkr Non
╔══════╤══════╗
Lung cancer ║ 83 │ 3 ║
╟──────┼──────╢
control ║ 72 │ 14 ║
╚══════╧══════╝
... this is then repeated for each of the ten rows.
For this example:
Statistics from combined values with Haldane correction:
Odds ratio = 4.510211
Approximate 95% CI = 3.733489 to 5.448524
Chi² for E(LOR) = 0 is 254.0865 P < 0.0001 ***
Chi² for Heterogeneity = 6.532662 P = 0.6856
Here we can say that there was no convincing evidence of heterogeneity between
the separate estimates of relative risk from each of the different studies.
The pooled estimate suggested that with 95% confidence that the true population
odds for being a smoker were between 3.7 and 5.4 times greater in lung cancer
patients compared with controls. The result using the Mantel-Haenszel method
gave 3.9 to 5.6, the difference is partly accounted for by the Haldane
correction. I would, however, advise you to keep to the Mantel-Haenszel method
for general use, it is more robust. I have included Woolf's method for those
who want to go further with the inter-table statistics.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Exact Tests|
¬<Fisher's exact test>╪243294 ¬
¬<Matched pairs (McNemar, Liddell)╪233702 ¬
¬<Exact confidence limits for 2 by 2 odds>╪247832 ¬
¬<Sign test>╪249896 ¬
Various exact treatments of two by two tables are given in this section.
Permutational probabilities and exact confidence limits are provided.
|Fisher's Exact Test|
This exact treatment of the fourfold table should be used instead of the chi
square test when any of the expected frequencies are less than five. In
practical terms, however, there is little point in using the fourfold chi
square test when Arcus provides you with a Fisher's exact test which can cope
with reasonably large numbers. Arcus uses the definition of a two tailed p
value described by N. T. J. Bailey (ref 27). Finney recommends doubling the
one tailed value and controversy remains. Arcus calculates the conventional
exact test until the numbers are so large that the intermediate steps would
cause overflow error, at this point the hyper geometric distribution is
utilised. The data entry is identical to the procedure for the chi-square 2
by 2 table and indeed, results for a chi-square test are given with Fisher's
exact test results. The rearranged table is displayed with the expectation of
the first cell. The chi-square test results are included for educational
purposes only, you should make your inferences from the Fisher's p values.
EXAMPLE (from Armitage ref 4 p 130):
The following data compare malocclusion of teeth with method of feeding infants.
Normal teeth Malocclusion
Breast fed 4 16
Bottle fed 1 21
To analyse these data in Arcus you must select the Fisher's exact test function
from the exact tests sub-menu of the instant functions menu in the analysis
section. Enter the frequencies into the contingency table on screen as shown
above.
For this example:
Rearranged table:
╔════════════════╤════════════════╤════════════════╗
║ 4 │ 1 │ 5 ║
╟────────────────┼────────────────┼────────────────╢
║ 16 │ 21 │ 37 ║
╠════════════════╪════════════════╪════════════════╣
║ 20 │ 22 │ 42 ║
╚════════════════╧════════════════╧════════════════╝
Expectation of A = 2.380952
1-tailed probability (Upper tail) = 0.143527 (Doubled = 0.287054)
2-tailed probability (by summation) = 0.174484
Here we have to accept the null hypothesis that there is no association between
these two classifications, i.e. between feeding method and malocclusion.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Expanded Fisher-Irwin test|
This allows you to see a conventional Fisher's exact test in more detail.
The complete conditional distribution for the observed marginal totals
is displayed. Arcus utilises double precision floating point arithmetic
for the exact tests (ref 27).
EXAMPLE (from Armitage ref 4 p 130):
The following data compare malocclusion of teeth with type of feeding received
by infants.
Normal teeth Malocclusion
Breast fed 4 16
Bottle fed 1 21
To analyse these data in Arcus you must select the Fisher's exact test function
from the exact tests sub-menu of the instant functions menu in the analysis
section. Enter the frequencies into the contingency table on screen as shown
above.
For this example:
Rearranged table:
╔════════════════╤════════════════╤════════════════╗
║ 4 │ 1 │ 5 ║
╟────────────────┼────────────────┼────────────────╢
║ 16 │ 21 │ 37 ║
╠════════════════╪════════════════╪════════════════╣
║ 20 │ 22 │ 42 ║
╚════════════════╧════════════════╧════════════════╝
Expectation of A = 2.380952
A Lower Tail Individual P Upper Tail
0 0.030956848030019 0.030956848030019 1.000000000000000
1 0.202939337085679 0.171982489055660 0.969043151969981
2 0.546904315196998 0.343964978111320 0.797060662914321
3 0.856472795497186 0.309568480300188 0.453095684803002
4 0.981774323237738 0.125301527740552 0.143527204502814
5 1.000000000000000 0.018225676762262 0.018225676762262
1-sided probability (Upper tail) = 0.1435272045 (Doubled = 0.2870544090)
2-sided probability (by summation)= 0.1744840525
Here we have to accept the null hypothesis that there is no association between
these two classifications, i.e. between feeding mode and malocclusion.
¬<p values>╪29175 ¬
¬<reference list>╪310584 ¬
|Exact Confidence Limits for 2 by 2 Odds|
Gart's method is used here to construct exact confidence limits for the odds
ratio of a fourfold table (ref A15). The default selections are 95, 99 and 90
per cent two tailed values but you may enter individual tail areas. Thus, for
a one tailed 95% confidence limit you would enter a lower tail area of 0 and
an upper tail area of 5. These exact confidence limits complement Fisher's
exact test of independence in a fourfold table. Please note that this
iterative calculation will take a long time with large numbers.
EXAMPLE (from Thomas ref A15):
The following data look at the criminal convictions of twins in an attempt to
investigate the hereditability of criminality.
Convicted Not-Convicted
Dizygotic 2 15
Monozygotic 10 3
To analyse these data in Arcus you must select exact confidence limits for
2 by 2 odds from the exact tests sub-menu. To select a 95% two tailed
confidence interval just press enter when you are presented with the confidence
interval menu.
For this example:
Rearranged table:
╔════════════════╤════════════════╤════════════════╗
║ 15 │ 2 │ 17 ║
╟────────────────┼────────────────┼────────────────╢
║ 3 │ 10 │ 13 ║
╠════════════════╪════════════════╪════════════════╣
║ 18 │ 12 │ 30 ║
╚════════════════╧════════════════╧════════════════╝
Fisher-Irwin p (1 sided) = 0.000465 Doubled = 0.00093
Confidence limits with 2.5% lower tail area and 2.5% upper tail area
{two tailed}
Observed odds ratio = 25
Confidence limits = 301.4666 and 2.753266
Reciprocal = 0.04
Confidence limits = 0.003317 and 0.363205
Here we can say with 95% confidence that the odds of being a criminal convict
are between 2.75 and 301.5 times greater for identical than for non-identical
twins.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Sign test|
In a sample of size n, if r individuals show a change in one particular
direction then the sign test can be used to assess the significance of this
change. Arcus gives you one and two sided cumulative probabilities from a
binomial distribution with a projected proportion of 0.5 for the null
hypothesis. An appropriate normal approximation is used with large numbers.
You are also given an exact confidence interval for the proportion r/n
(ref 5,6). If you need a test where the projected proportion for the null
hypothesis is not 0.5 then you should use the ¬single proportion╪263180 ¬ function
listed in the proportions sub-menu of the Arcus instant functions menu.
EXAMPLE (from Altman ref 4 p 186)
Out of a group of 11 women investigated 9 were found to have a food energy
intake below the daily average and 2 above. We want to quantify the impact
of 9 out of 11, i.e. how much evidence have we got that these women are
different from the norm?
To analyse these data in Arcus you must select the sign test from the instant
functions menu of the analysis section. To select a 95% two tailed confidence
interval just press enter when you are presented with the confidence interval
menu.
For this example:
For 11 pairs with 9 on one side.
Cumulative probability (2-sided) = 0.06543
(1-sided) = 0.032715 *
Exact 95% Confidence limits for the Proportion:
Lower Limit = 0.482248
Proportion = 0.818182
Upper Limit = 0.977122
If we were confident that this group could only realistically be expected to
have a lower caloric intake then we could make inference from the one tailed
p value. We do not, however, have this evidence so we must accept the null
hypothesis that this proportion is not significant. We can say with 95%
confidence that the true population value of the proportion lies somewhere
between 0.48 and 0.98. The most sensible response to these results would be
to go back and collect more data.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Randomisation| Functions
This section employs a well tried and widely accepted random number generator
to randomise series of numbers for given allocation designs. The results can
be used in the design of randomised studies. Please note that the random
number generator is reseeded each time it is used and you have virtually no
chance of using the same (pseudo)random number series for different
randomisations. For more information on the random number generator used here
please see "¬random numbers╪254271 ¬".
a) You can randomise a series of integers for which you define the
beginning and end points of the series. For example, randomising
numbers from 6 to 10 might give 8 6 9 10 7, this is like shuffling
5 cards labelled 6 to 10.
b) Random allocation of cases and controls for paired case-control
studies. For example, you might want to randomise 50 patients into
treatment (case) and placebo (control) groups for a pilot study of a
new drug. This would give 50 pairs of CASE - CONTROL or CONTROL - CASE.
If this was a randomised crossover study then you would give drug first
if the order was CASE - CONTROL and you would give placebo first if the
order was CONTROL - CASE.
c) Random allocation of subjects to case or control groups for unpaired
case-control studies. For example, you might want to look at the effect
of a new treatment. For a randomised controlled trial you might
randomly allocate some patients for this new treatment and compare them
with similar patients who did not receive this treatment. For 24
patients in two groups of 12 you would enter 24 into this section of
Arcus randomisation. This would give you two groups of 12 e.g.:
CASES CONTROLS
2 1
5 3
6 4
7 8
9 11
10 12
13 14
15 16
19 17
20 18
21 22
24 23
Here the first patient would be allocated to the control group and
the second to the treatment group etc.
|Random Numbers|
There is much fear of computer generated random numbers because of some bad
random number generators which have cropped up over the years. This is not a
problem in Arcus Pro-Stat because it uses well tried and tested methods.
If you want to get down to basics you might ask, what is random?. A lecture
theatre filled with Mathematicians, Philosophers and Elemental Physicists
would love to debate this, enough said. What we can do is look for evidence
of non-randomness such as repeated patterns. Various methods have been
employed to look for non-randomness from "random" number generators since they
began to emerge around 35 years ago. Several "quick and dirty" random number
generators have become widely used because they are supplied with computer
language compilers. These generators often use over simple methods which
produce sequences of numbers with repeating patterns. This is unacceptable for
statistical use.
Arcus Pro-Stat uses the widely accepted Park & Miller "minimal" method extended
with a Bays-Durham shuffle. This is well described by Press et al. (ref 33).
Most random number generators require a seed. If the generator is given the
same seed each time it is called then it will produce the same series of
numbers. This is not acceptable for many purposes therefore Arcus seeds the
random number generator with a number which is the taken from your computer's
clock. This number is the number of hundredth's of a second which have elapsed
since midnight. You will therefore understand why it is very difficult to
recall the same "random" sequence from Arcus when you ask Arcus to seed the
generator for you. You can also choose to enter your own seed.
|Sample Size| Estimations
¬<for paired t test>╪257921 ¬
¬<for unpaired t test>╪258472 ¬
¬<for independent case-control>╪259089 ¬
¬<for matched case-control>╪260701 ¬
¬<for independent prospective>╪259879 ¬
¬<for paired prospective>╪261568 ¬
¬<for population surveys>╪262306 ¬
At the design stage of an investigation one must try to minimise the probability
of failing to detect a real effect, i.e. type II error (false negative).
Minimum sample sizes necessary to avoid given levels of type II error are
calculated by Arcus for population surveys, for the comparison of proportions
and for the comparison of means.
Type II error is indicated in reverse by the power of a study, thus power is the
probability of detecting a true effect. You are asked to select a power level
for your study along with the two tailed significance level which you intend to
use in subsequent analysis. The latter considers type I error, the probability
of incorrectly rejecting the null hypothesis (false positive).
Minimum sample sizes are estimated for the comparison of means using Student t
tests, the comparison of proportions and for population surveys. Provision is
made for paired and unpaired designs in case-control studies or independent
group studies. All of these calculations require you to enter a value for power
(the probability of detecting a true effect) and alpha (the probability of
detecting a false effect); all calculations consider two tailed investigation
(ref 4, 8, 11, 30, 31). Other information required depends upon the type of
study being planned; each required parameter is described in the help screen of
the relevant menu selection. I must emphasise the point that good design lies
at the heart of good research and for important studies statistical advice
should be sought at the planning stage!.
¬<reference list>╪310584 ¬
Sample Size |for Paired t Test|
This function gives you the minimum number of pairs of subjects needed to detect
a true difference DELTA in population means with power POWER and two sided
type I error probability ALPHA (ref 30, 31).
INFORMATION REQUIRED:
POWER - Probability of detecting a true effect.
ALPHA - Probability of detecting a false effect (two sided).
DELTA - Difference in population means.
SD - Estimated standard deviation of paired response differences.
¬<reference list>╪310584 ¬
¬<sample size>╪256010 ¬
Sample Size |for Unpaired t Test|
This function gives you the minimum number of experimental subjects needed to
detect a true difference DELTA in population means with power POWER and two
sided type I error probability ALPHA (ref 30, 31).
INFORMATION REQUIRED:
POWER - Probability of detecting a true effect.
ALPHA - Probability of detecting a false effect (two sided).
DELTA - Difference in population means.
SD - Estimated standard deviation for within group differences.
M - Number of control subjects per experimental subject.
¬<reference list>╪310584 ¬
¬<sample size>╪256010 ¬
Sample Size |for Independent Case-Control| studies
This function gives the minimum number of case subjects required to detect a
real odds ratio or case exposure rate with power POWER and two sided type I
error probability ALPHA. This sample size is also given as a continuity
corrected value intended for use with corrected chi-square and Fisher's exact
tests (ref 10, 30).
POWER - Probability of detecting a real effect.
ALPHA - Probability of detecting a false effect (two sided).
P0 - Probability of exposure in controls.
(P1 - Probability of exposure in case subjects.) *Input P1 or OR.
(OR - Odds ratio of exposures between cases and controls.)
M - Number of control subjects per case subject.
¬<reference list>╪310584 ¬
¬<sample size>╪256010 ¬
Sample Size |for Independent Prospective| studies
This function gives the minimum number of case subjects required to detect a
true relative risk or experimental event rate with power POWER and two sided
type I error probability ALPHA. This sample size also given as a continuity
corrected value intended for use with corrected chi-square and Fisher's exact
tests (ref 8, 10, 30).
POWER - Probability of detecting a real effect.
ALPHA - Probability of detecting a false effect (two sided).
P0 - Probability of event in controls.
(P1 - Probability of event in experimental subjects) *Input P1 or RR.
(RR - Relative risk of events between experimental subjects and controls.)
M - Number of control subjects per experimental subject.
¬<reference list>╪310584 ¬
¬<sample size>╪256010 ¬
Sample Size |for Matched Case-Control| studies
This function gives you the minimum sample size necessary to detect a true
odds ratio OR with power POWER and a two sided type I error probability ALPHA.
If you are using more than one control per case then this function also provides
the reduction in sample size relative to a paired study that you can obtain
using your number of controls per case (ref 10, 30).
INFORMATION REQUIRED:
POWER - Probability of detecting a real effect.
ALPHA - Probability of detecting a false effect (two sided).
R - Correlation coefficient for exposure between matched
cases and controls.
P0 - Probability of exposure in the control group.
P1 - Number of control subjects matched to each case subject.
OR - Odds ratio.
¬<reference list>╪310584 ¬
¬<sample size>╪256010 ¬
Sample Size |for Paired Prospective| studies
This function gives you the minimum number of subject pairs that you require
to detect a true relative risk RR with power POWER and two sided type I error
probability ALPHA (ref 10, 30).
INFORMATION REQUIRED:
POWER - Probability of detecting a real effect.
ALPHA - Probability of detecting a false effect (two sided).
R - Correlation coefficient for failure between paired subjects.
***Next input is either P0 and RR or P0 and P1 (when RR=P0/P1).***
P0 - Event rate in the control group.
*(P1 - Event rate in experimental group.)
*(RR - Risk of failure of experimental subjects relative to controls.
¬<reference list>╪310584 ¬
¬<sample size>╪256010 ¬
Sample Size |for Population Surveys|
This function gives you the minimum number of subjects that you require for a
survey of a population for a difference in the proportion of individuals in
that population displaying a particular factor (ref 10).
INFORMATION REQUIRED:
Confidence level (i.e. 1-ALPHA)
(ALPHA - Probability of detecting a false effect (two sided).)
Population size
Proportion (as %) of the population displaying a particular factor.
A difference (as %) in that proportion you want to be able to detect.
¬<reference list>╪310584 ¬
¬<sample size>╪256010 ¬
|Proportions|
¬<Single proportion>╪263180 ¬
¬<Paired proportions>╪266823 ¬
¬<Unpaired proportions>╪265024 ¬
This section constructs confidence limits and probabilities for various
presentations of proportions. Exact tests are employed wherever possible.
|Single Proportion|
This function gives you the exact and approximate confidence interval for a
single proportion. There is also an hypothesis test for the proportion in
comparison with the expected proportion under the null hypothesis. You enter
this expected proportion when prompted for the probability of success on each
trial. This test uses the relevant binomial distribution. For example, when
comparing two preparations of a drug, if 65 out of 100 patients preferred
preparation A then the significance of this majority could be expressed by the
hypothesis test and described by the confidence interval (ref 4, 11).
EXAMPLE (from Armitage ref 4 p 116):
In a trial of two analgesics, X and Y, 100 patients tried each drug for a week.
The trial order was randomised. 65 out of 100 preferred drug Y.
To analyse these data in Arcus you must select single proportion from the
proportions sub-menu of the instant functions menu in the analysis section.
To select a 95% confidence interval just press enter when you are presented
with the confidence interval menu. Enter n as 100 and r as 65. Enter the
binomial test proportion as 0.5, this is because you would expect 50% of an
infinite number of patients to prefer drug Y if there was no difference between
X and Y.
For this example:
Proportion = 0.65
Exact 95% Confidence Limits:
Lower Limit = 0.548151
Upper Limit = 0.742706
Using null hypothesis that the population proportion equals 0.5:
Binomial two tailed P = 0.0035 **
Here we can conclude that the proportion was statistically significantly
different from 0.5. With 95% confidence we can state that the true population
value for the proportion lies somewhere between 0.55 and 0.74.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Unpaired Proportions|
Two independent proportions may be compared using this function. It is assumed
that your data have been observed from random samples of the two independent
populations. For example, the proportion of patients surviving a particular
surgical emergency could be compared for surgical and non-surgical management
protocols. An hypothesis test for the equality of these proportions is given
along with a confidence interval for the difference between the proportions.
A normal approximation is used for both of these methods thus you should avoid
small numbers (ref 4).
EXAMPLE (from Armitage ref 4 p 124):
Two methods of treatment, A and B, for a particular disease were investigated.
Out of 257 patients treated with method A 41 died and out of 244 patients
treated with method B 64 died. We want to compare these fatality rates.
To analyse these data in Arcus you must select unpaired proportions from the
proportions sub-menu of the instant functions menu in the analysis section.
To select a 95% confidence interval just press enter when you are presented
with the confidence interval menu. Enter n1 as 257, r1 as 41, n2 as 244 and
r2 as 64.
For this example:
Proportion 1 = 0.159533
Proportion 2 = 0.262295
95% confidence interval for the difference = -0.173829 to -0.031695
Normal deviate (Z) = -2.824689
Two tailed P = 0.0047 **
One tailed P = 0.0024 **
Here we can conclude that the difference between these two proportions is
statistically significantly different from zero. With 95% confidence we can
state that the true population fatality rate with treatment B is between 0.03
and 0.17 greater than with treatment A.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Paired Proportions|
Two proportions may be paired by sharing a common feature. For example, when
comparing two culture media a sputum sample from one patient is plated onto both
culture media, this is the "pairing". The procedure is then repeated for a
number of patients to allow proportions to be compared. Arcus gives you an
hypothesis test for the equality of these proportions and a confidence interval
for the difference between them. Exact methods are used throughout (ref 4, 20).
The two tailed p value from the hypothesis test equates with the exact test for
a paired fourfold table (Liddell) which has been presented above. With large
numbers an appropriate normal approximation is used in the hypothesis test.
EXAMPLE (from Armitage ref 4 p 122):
The data below represent a comarison of two media for culturing Mycobacterium
tuberculosis. Fifty suspect sputum specimens were plated up on both media
and the following results were obtained:
Medium B
Growth No Growth
Medium A: Growth 20 12
No Growth 2 16 N = 50
To analyse these data in Arcus you must select paired proportions from the
proportions sub-menu of the instant functions menu in the analysis section.
Select a 95% confidence interval by pressing enter when you are presented
with the confidence interval menu. Enter n as 50, ++(k) as 20, +-(r) as 12 and
-+(s) as 2.
For this example:
Proportion 1 = 0.64 (k+r)/n
Proportion 2 = 0.44 (k+s)/n
Proportion difference = 0.2 (r-s)/n
Cumulative probability (2-sided) = 0.012939 *
(1-sided) = 0.00647 **
Exact 95% Confidence Limits for the proportion difference:
Lower Limit = 0.040251
Upper Limit = 0.270014
Here we can conclude that the proportion difference is statistically
significantly different from zero. With 95% confidence we can say that the
true population value for the proportion difference lies somewhere between
0.04 and 0.27. This leaves us with little doubt that medium A is more
effective than medium B for the culture of tubercle bacilli.
Compare these results with the exact test for ¬matched pairs╪233702 ¬. Some find it
easier to discuss this type of result in terms of estimated relative risk.
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Miscellaneous| Functions
¬<Relative risk>╪269584 ¬
¬<Diagnostic test 2 by 2 table>╪272252 ¬
¬<Likelihood ratios for 2 by k tables>╪276470 ¬
¬<Number needed to treat>╪279627 ¬
¬<False result probabilities>╪282414 ¬
¬<Standardized mortality ratios>╪285320 ¬
|Relative Risk| in Incidence Studies
In studies of the incidence of a particular outcome in two groups of
individuals, defined by the presence or absence of a particular characteristic,
the odds ratio for the resultant fourfold table becomes the relative risk.
Relative risk is used for prospective studies where you follow groups with
different characteristics to observe whether or not a particular outcome
occurs:
Group 1 Group 2
OUTCOME YES A B
NO C D
Relative Risk = [A/(A+C)]/[B/(B+D)]
In retrospective studies where you select subjects by outcome not by group
characteristic then you would use the odds ratio ((a/c)/(b/d)) and not the
relative risk. The odds ratio is often appropriate to case-control studies.
Arcus gives confidence intervals for the odds ratio in the 2 by 2 chi-square
test and in the exact confidenece interval for 2 by 2 odds which is listed in
the exact tests menu.
This function gives you the relative risk with a confidence interval. The
iterative methods of approximation recommended by Gart and Nam are used in this
function (ref 35). Please note that relative risk, risk ratio and likelihood
ratio are the same calculation.
EXAMPLE (from Altman ref 5 p 267)
The following data represent a prospective investigation of Apgar score in
babes who had been classified either as symmetric or asymmetric growth
retardation on the basis of ultrasound investigation.
Symmetric IUGR Asymmetric IUGR
Apgar < 7 2 33
Apgar >=7 14 58
To analyse these data in Arcus you must select relative risk from the
miscellaneous sub-menu of the instant functions menu in the analysis section.
Select a 95% confidence interval by pressing enter when you are presented
with the confidence interval menu. Then enter the above frequencies into the
2 by 2 table on the screen.
For this example:
Risk ratio (relative risk in incidence study) = 0.344697
The 95% CI = 0.094377 to 1.040814
The 90% CI = 0.114327 to 0.902673
N.B. This is more accurate than the logit confidence interval quoted in ref 5.
Here we can say that the risk of a low Apgar score for symmetrically growth
retarded babes is about 35% of that risk for their asymmetrically growth
retarded counterparts. There are, however, rather few observations in the
symmetrical group which is reflected by the broad 95% confidence interval.
An appropriate response to these "suggestive" results would be to go back and
collect more data.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Diagnostic Test 2 by 2 table|
The quality of a diagnostic test is often expressed in terms of sensitivity and
specificity. Sensitivity is the ability of that test to pick up what you are
looking for and specificity is the ability of the test to reject what you are
not looking for.
DISEASE
Present Absent
TEST + a (true +ve) b (false +ve)
- c (false -ve) d (true -ve)
Sensitivity = a/(a+c)
Specificity = d/(b+d)
Likelihood ratio of a positive test = [a/(a+c)]/[b/(b+d)]
Likelihood ratio of a negative test = [c/(a+c)]/[d/(b+d)]
Likelihood ratios have become useful because they enable one to quantify the
effect a particular test result has on the probability of a certain diagnosis
or outcome. Using a simplified form of Bayes' theorem:
posterior odds = prior odds * likelihood ratio
where odds = probability/(1-probability)
probability = odds/(odds+1)
This Arcus function gives you the predictive values (post-test likelihood) with
change, prevalence (pre-test likelihood), sensitivity, specificity and
likelihood ratios (ref 12, 36). The confidence intervals for the likelihood
ratios are constructed using the iterative method suggested by Gart and Nam
(ref 35). This function is not truly Bayesian because it does not use any
starting probability. It does, however, provide a generator for likelihood
ratios which can then be used to direct the flow of probability in Bayesian
analysis. For an excellent account of this approach in medical diagnosis I
advise you to read David Sackett's book (ref 12).
EXAMPLE (from Sackett ref 12 p 109):
Initial creatine phosphokinase (CK) levels were related to the subsequent
diagnosis of acute myocardial infarction (MI) in a group of patients with
suspected MI. 80 international units of CK or greater was taken as an arbitrary
positive test result:
MI No MI
CK >= 80 215 16
CK < 80 15 114
To analyse these data in Arcus you must select diagnostic test 2 by 2 table
from the miscellaneous sub-menu of the instant functions menu in the analysis
section. Select a 95% confidence interval by pressing enter when you are
presented with the confidence interval menu. Then enter the above frequencies
into the 2 by 2 table on the screen.
For this example:
Disease / Feature:
present absent totals
Test: ╔══════════════════╤══════════════════╤══════════════════╗
Positive║ 215 │ 16 │ 231 ║
║ A│B │ ║
╟──────────────────┼──────────────────┼──────────────────╢
Negative║ 15 C│D 114 │ 129 ║
║ │ │ ║
╟──────────────────┼──────────────────┼──────────────────╢
Totals║ 230 │ 130 │ 360 ║
╚══════════════════╧══════════════════╧══════════════════╝
Prevalence (pre-test likelihood of disease) = 0.638889 = 64%
Predictive value of +ve test
(post-test likelihood of disease) = 0.930736 = 93% {change = 29%}
Predictive value of -ve test
(post-test likelihood of no disease) = 0.116279 = 12% {change = -52%}
Sensitivity (true positive rate) = 0.934783 = 93%
Specificity (true negative rate) = 0.876923 = 88%
Likelihood ratios with 95% confidence intervals:
LR (positive test) = 7.595109 (4.897431 to 12.12324)
LR (negative test) = 0.074371 (0.045345 to 0.120077)
Here we can say with 95% confidence that CK results of >=80 are at least 4.9
times more likely to come from patients who have had an MI than they are to
come from those who have not had an MI. Also with 95% confidence we can say
that CK results of <80 are at most only one tenth (0.12) as likely to come
from patients who have had an MI than they are to come from those who have not
had an MI.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Likelihood ratios for 2 by k tables|
The quality of a diagnostic test is often expressed in terms of sensitivity and
specificity. Sensitivity is the ability of that test to pick up what you are
looking for and specificity is the ability of the test to reject what you are
not looking for.
DISEASE
Present Absent
TEST + a (true +ve) b (false +ve)
- c (false -ve) d (true -ve)
Sensitivity = a/(a+c)
Specificity = d/(b+d)
Likelihood ratio of a positive test = [a/(a+c)]/[b/(b+d)]
Likelihood ratio of a negative test = [c/(a+c)]/[d/(b+d)]
Likelihood ratios have become useful because they enable one to quantify the
effect a particular test result has on the probability of a certain diagnosis
or outcome. Using a simplified form of Bayes' theorem:
posterior odds = prior odds * likelihood ratio
where odds = probability/(1-probability)
probability = odds/(odds+1)
We can generalise these methods to situations of more than two test outcomes.
In this situation we have a two by k design where k is the number of test
outcomes studied. If one test outcome is called test level j then the
likelihood ratio at level j is given by:
likelihood ratio j = p(tj_disease)/p(tj_no disease)
where p(tj_ is the proportion displaying the relevant test result at level j
This Arcus function gives you likelihood ratios and their confidence intervals
for each level of test result (ref 12, 36). The confidence intervals for the
likelihood ratios are constructed using the iterative method suggested by Gart
and Nam (ref 35).
EXAMPLE (from Sackett ref 12 p 111):
Initial creatine phosphokinase (CK) levels were related to the subsequent
diagnosis of acute myocardial infarction (MI) in a group of patients with
suspected MI. Four ranges of CK result were chosen for the study:
MI No MI
CK >= 280 97 1
CK = 80-279 118 15
CK = 40-79 13 26
CK = 1-39 2 88
To analyse these data in Arcus you must select likelihood ratios for 2 by k
tables from the miscellaneous sub-menu of the instant functions menu in the
analysis section. Select a 95% confidence interval by pressing enter when you
are presented with the confidence interval menu. Enter the number of test
levels as 4 then enter the above frequencies as prompted on the screen.
For this example:
RESULT + FEATURE - FEATURE Likelihood ratio with 95% CI
1 97 1 54.82609 (9.923024 to 311.5679)
2 118 15 4.446377 (2.772549 to 7.315978)
3 13 26 0.282609 (0.151798 to 0.524821)
4 2 88 0.012846 (0.003513 to 0.046229)
Here we can say with 95% confidence that CK results of >=280 are at least ten
(9.9) times more likely to come from patients who have had an MI than they are
to come from those who have not had an MI.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|Number needed to treat|
The object of treating patients is to prevent adverse outcomes. If we look at
one treatment or intervention in isolation then we can study its effect on the
outcome or the adverse effect in question. Laupacis et al. quote the large
Veterans Administration Trial where anti-hypertensives were investigated over
three years for their effect on target organ damage rates (ref 37). Let us
look at the definitions of some outcome statistics:
Treated Placebo
ADVERSE EVENT YES A B
NO C D
LET: Pc = proportion of subjects in control group who suffer an event
Pt = proportion of subjects in treated group who suffer an event
Pc = B / (B + D)
Pt = A / (A + C)
THEN: Relative risk reduction = (Pc - Pt) / Pc = RR
Absolute risk reduction = Pc - Pt = ARR = RR * Pc
Number needed to treat = 1 / (Pc - Pt) = 1 / ARR
Arcus gives you relative risk, relative risk reduction, absolute risk reduction
and the number needed to treat. Confidence intervals for each of these
statistics are calculated using the iterative approaches advocated by Gart and
Nam (ref 35, 38).
EXAMPLE (from Haynes & Sackett ref 38):
In a trial of a drug for the treatment of severe congestive heart failure 607
patients were treated with a new angiotensin converting enzyme inhibitor (ACEi)
and 607 other patients were treated with a standard non-ACEi régime. 123 out
of 607 patients on the non-ACEi régime died within six months and 94 out of the
607 ACEi treated patients died within six months.
To analyse these data in Arcus you must select number needed to treat from the
miscellaneous sub-menu of the instant functions menu in the analysis section.
Select a 95% confidence interval by pressing enter when you are presented with
the confidence interval menu. Enter the number of controls as 607 with 123
suffering an event and enter the number treated as 607 with 94 suffering an
event.
For this example:
Proportion of controls suffering an event = 0.202636
Proportion of treated suffering an event = 0.15486
With 95% CI's:
Relative risk = 0.764228 (0.598901 to 0.974216)
Relative risk reduction = 0.235772 (0.025784 to 0.401099)
Absolute risk reduction = 0.047776 (0.005225 to 0.081277)
Number needed to treat = 21 (12 to 191)
Here we can say, with 95% confidence, that you need to treat as many as 191
or as few as 12 patients in severe congestive heart failure with this ACEi in
order to prevent one death that would not have been prevented with the standard
non-ACEi therapy in six months of treatment.
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
|False result probabilities|
When considering a diagnostic test for screening populations it is important
to consider the number of false negative and false positive results you will
have to deal with. The quality of a diagnostic test is often expressed in
terms of sensitivity and specificity. Sensitivity is the ability of that test
to pick up what you are looking for and specificity is the ability of the test
to reject what you are not looking for.
DISEASE
Present Absent
TEST + a (true +ve) b (false +ve)
- c (false -ve) d (true -ve)
Sensitivity = a/(a+c)
Specificity = d/(b+d)
We can apply Bayes' theorem if we know the approximate likelihood that a subject
has the disease before they come for screening, this is given by the prevalence
of the disease. For low prevalence diseases the false negative rate will be
low and the false positive rate will be high. For high prevalence diseases the
the false negative rate will be high and the false positive rate will be lower.
People are often surprised by the high numbers of projected false positives, you
need a highly specific test to keep this number low. The false positive rate
of a screening test can be reduced by repeating the test. In some cases a test
is performed three times and the patient is declared positive if at least two
out of the three component tests were positive. This Arcus function simply
gives you the probability of false positive and false negative results for a
given prevalence of the disease being tested for (ref 8).
EXAMPLE (from Fleiss ref 8 p 9):
In a hypothetical example 2000 patients were tested with a screening test for
a disease. Of these 2000 patients 1000 were known to have the disease and 1000
were known to be free of the disease:
DISEASE
Present Absent
TEST + 950 (true +ve) 10 (false +ve)
- 50 (false -ve) 990 (true -ve)
To analyse these data in Arcus you must select false result probabilities from
the miscellaneous sub-menu of the instant functions menu in the analysis
section. Enter the true +ve rate as 0.95 (950/(950+50)) and the false +ve rate
as 0.01 (10/(990+10)). Enter the prevalence as 1 in 100 by entering n as 100.
For this example:
For prevalence of 100 per ten thousand of population tested:
Test SENSITIVITY = 95%
Probability of a FALSE POSITIVE result = 0.510309
Test SPECIFICITY = 99%
Probability of a FALSE NEGATIVE result = 0.00051
Here we see that more than half of the patients tested will give a positive
test when they do not have the disease. This is clearly not acceptable for a
full screening method but could be used as pre-screening before further tests
if there was no better initial test available.
¬<reference list>╪310584 ¬
|Standardized Mortality Ratios|
This selection uses the indirect method to calculate standardized mortality
ratios. You must supply the mortality rates from a reference population, often
census data, and the size of each group of your study population. For each
(age) group you enter the size of that group in your study population and the
age/group specific mortality from the general population. You are then asked
about the units in which your mortality data were entered, for example if you
entered deaths per 10,000 you should enter 10,000 and if you entered decimal
fractions you should enter 1. The SMR is expressed in ratio and decal-integer
formats along with its approximate confidence limits. A test based on the null
hypothesis that the number of observed and expected deaths are equal is also
given. This test uses a Poisson distribution (ref 4, 2, 11).
EXAMPLE (from Bland ref 2 p 301):
The following data represent the age-specific mortality rates for liver
cirrhosis in men and the number of male doctors in each age stratum:
Age group Mortality per million men per year Number of male doctors
15-24 5.895 1080
25-34 13.050 12860
35-44 46.937 11510
45-54 161.503 10330
55-64 271.358 7790
To analyse these data in Arcus you must select standardized mortality ratios
from the miscellaneous sub-menu of the instant functions menu in the analysis
section. Enter the number of groups as 5 then enter mortality and group size
for each age group. Note that group size refers to the study group of doctors
and not the male population as a whole who were used to derive the mortality
data. Enter the mortality denominator as 1000000. Then after the expectation
table enter the observed deaths as 14. Select a 95% confidence interval by
pressing enter when you are presented with the confidence interval menu.
For this example:
Group(age)-specific Observed Population Expected Deaths
mortality
0.000005859 1080 0.006328
0.00001305 12860 0.167823
0.000046937 11510 0.540245
0.000161503 10330 1.668326
0.000271358 7790 2.113879
Total = 4.496601
Standardized Mortality Ratio = 3.113463
(sometimes quoted as 100 x integer = 311)
95% confidence interval = 1.482561 to 4.744365 (148 to 474)
Probability of observing 14 or more deaths by chance P = 0.0002 ***
Probability of observing 14 or fewer deaths by chance P = 0.9999
Here we can see that the total expected deaths from liver cirrhosis in male
doctors is 4.5 per year. The observed number, 14, was statistically highly
significantly greater than expected. With 95% confidence we can state that
male doctors in this country exhibit between 1.5 and 4.7 times the number of
deaths from liver cirrhosis than expected from the general male population of
a similar age distribution. If the reason for this SMR is not obvious to you
then please attend a "ward night out" - hic!
¬<p values>╪29175 ¬
¬<confidence intervals>╪31897 ¬
¬<reference list>╪310584 ¬
The Arcus |Algebraic Calculator|
This function is available throughout Arcus. It is called up by pressing the
key combination [Alt]+[C]. You can use it to evaluate complex expressions or
to perform simple arithmetic. A seventy character algebraic expression
evaluator is provided. All calculations are done in double precision. If you
wish to evaluate an expression which consists of more than seventy characters
then you can use the Arcus worksheet; the result, however, will be in single
precision only.
The functions available are listed in the help screens which are invoked by the
usual F1 key press. These are the functions which are available in the Arcus
Worksheet plus LR which represents the last result provided by this calculator.
You can use LR in an expression even when the last result was not calculated
in the present calculator session.
Supported functions are:
Constants: PI
EE as e
ABS absolute value
CLOG common (base 10) logarithm
CEXP anti log (base 10)
EXP anti log (base e)
LOG natural (base e, Naperian) logarithm
SQR square root
! factorial (max 34)
LN! log factorial
IZ normal deviate for a p value
UZ upper tail p for a normal deviate
LZ lower tail p for a normal deviate
^ exponentiation (to the power of)
+ addition
- subtraction
* multiplication
/ division
\ integer division
ARCCOS arc cosine
ARCCOSH arc hyperbolic cosine
ARCCOT arc cotangent
ARCCOTH arc hyperbolic cotangent
ARCCSC arc cosecant
ARCCSCH arc hyperbolic cosecant
ARCTANH arc hyperbolic tangent
ARCSEC arc secant
ARCSECH arc hyperbolic secant
ARCSIN arc sine
ARCSINH arc hyperbolic sine
ATN arc tangent
COS cosine
COT cotangent
COTH hyperbolic cotangent
CSC cosecant
CSCH hyperbolic cosecant
SINH hyperbolic sine
SECH hyperbolic secant
SEC secant
TAN tangent
TANH hyperbolic tangent
AND logical AND
NOT logical NOT
OR logical OR
< less than
= equal to
> greater than
Please note that the largest factorial allowed is 170! but you can work with Log
factorials via the LOG! function, e.g. LOG!(171).
Calculations give an order of priority to arithmetic operators, this must be
considered when entering expressions. For example, the result of the expression
"6 - 3/2" is 4.5 and not 1.5 because division takes priority over subtraction.
The following list gives the priority of arithmetic operators in descending
order:
1. Exponentiation (^)
2. Negation (-X)
(Exception = x^-y; i.e. 4^-2 is 0.0625 and not -16)
3. Multiplication and Division (*, /)
4. Integer Division (\)
5. Addition and Subtraction (+, -)
As you work through a session with the Arcus calculator you can save individual
expressions and their results to a notepad by pressing S or F2. The notepad is
activated when you finish the present calculator session, at this point it will
present you with a list of all the results and expressions which you have saved
using the S or F2 key during the preceding session. The notepad can be edited
and the results sent to a printer or to the current log file.
An expression and result stack is available in this calculator. You save
results and their expressions to the stack when you press S or F2, i.e. the
same process as saving results to the notepad. You can access information
from the stack for subsequent calculations using the up and down cursor keys.
These cursor keys enable you to search up and down the stack for old results
or expressions to edit.
|APPENDICES|
¬<Glossary>╪292963 ¬
¬<Error Codes>╪293909 ¬
¬<ASCII codes>╪294887 ¬
Appendix One (|Glossary|)
df = degrees of freedom
^ = to the power of
^Key = Ctrl + another Key
/ = divided by
* = multiplied by
Z = standardized normal deviate
r = Pearson's product moment correlation coefficient
p = probability, see ¬<p values>╪29175 ¬
α = significance level
x = individual value of a vector/group/sample
n = vector/group/sample size
µ = mean (e.g. arithmetic mean, µ = x/n)
VAR = variance (e.g. of mean, s² = Σx²-(Σx)²/n)
SD = standard deviation (e.g. of mean, s = SQR(VAR))
SE = standard error (e.g. of mean, SEM = SD/SQR(n))
MS = mean square
CI = confidence interval, see ¬<confidence intervals>╪31897 ¬
ln(x) = natural (Naperian, base e) logarithm of x
vs = versus
DOS = disk operating system
ROM = read only memory
PC = personal computer
Program = programme
Disk = disc
Appendix Two (|Error Codes|)
The error trap within Arcus Pro-Stat provides messages which explain most of
the common error states but error numbers alone are sometimes given:
5 Illegal function requested
6 Overflow/Under flow (Numbers >3.4E+38 or <1.7E-38 or vice versa for negatives)
7 Out of memory
9 Array or memory error
11 Division by zero
14 Out of memory for some text and internal program data
16 Formula too complex
24 Waited too long for printer (beep)
25 Printer fault
27 Out of paper
51 Internal computer error
53 Requested disk file not found
54 Bad file mode
55 Attempt to open an already open file (Internal)
57 Disk drive fault
61 Disk full
64 Bad file name
67 Too many files on disk/directory
68 Requested disk does not exist
70 Disk/File access denied
71 Disk drive not closed
72 Disk fault
76 Path not found
Appendix Three (|ASCII codes|)
These are the decimal codes which can be used in the Arcus database CHR function
and which are returned by the Arcus database ASC function. Please remember that
all of these characters are accessible through an extended keyboard by holding
down the Alt key and tapping out the relevant code on the right hand numeric key
pad. The table below lists the characters for codes 33 to 254. Values below
this do have character representations but they double as control characters,
e.g. 9 is a tab. It is best to avoid these control characters if you can. The
extended character set is represented by values above 126. Please note that
extended characters may appear different on different computers, most notably
those running foreign language settings of DOS.
30 40 50 60 70 80 90 100 110 120 130 140 150 160
0 ( 2 < F P Z d n x é î û á
1 ) 3 = G Q [ e o y â ì ù í
2 * 4 > H R \ f p z ä Ä ÿ ó
3 ! + 5 ? I S ] g q { à Å Ö ú
4 " , 6 @ J T ^ h r å É Ü ñ
5 # - 7 A K U _ i s } ç æ ¢ Ñ
6 $ . 8 B L V ` j t ~ ê Æ £ ª
7 % / 9 C M W a k u ë ô ¥ º
8 & 0 : D N X b l v Ç è ö ₧ ¿
9 ' 1 ; E O Y c m w ü ï ò ƒ ⌐
170 180 190 200 210 220 230 240 250
0 ┤ ╛ ╚ ╥ ▄ µ ≡ ·
1 ½ ╡ ┐ ╔ ╙ ▌ τ ± √
2 ¼ ╢ └ ╩ ╘ ▐ Φ ≥ ⁿ
3 ¡ ╖ ┴ ╦ ╒ ▀ Θ ≤ ²
4 « ╕ ┬ ╠ ╓ α Ω ⌠ ■
5 » ╣ ├ ═ ╫ ß δ ⌡
6 ░ ║ ─ ╬ ╪ Γ ∞ ÷
7 ▒ ╗ ┼ ╧ ┘ π φ ≈
8 ▓ ╝ ╞ ╨ ┌ Σ ε °
9 │ ╜ ╟ ╤ █ σ ∩ ∙
Code 170 and 124 characters are not shown above because they are special
characters used by this hypertext system. 170 is the angle bar on most
keyboards and 124 is the vertical dashed line on most keyboards. In this
hypertext system 124 is used either side of a section title and 170 is used
either side of a link item. These characters can not be used in the body of
the hypertext.
|HELP|
This hypertext system provides an electronic user guide for Arcus Pro-Stat.
You navigate its pages using the following key strokes:
[] Move up one line
[] Move down one line
[Page Up] Move up one page
[Page Dn] Move down one page
[Tab] Move to the next link item
[Shift]+[Tab] Move to the previous link item
[Enter] Select the highlighted link item
[Home] Move to top of current section
[End] Move to bottom of current section
[I] Search the title index
[S] Search the entire help text for a word or phrase
[B] Move back a page
[P], [E] Edit ± send current section to log file or printer
[Q], [Esc] Quit Arcus Hypertext
The left mouse button selects the link item or the bottom menu bar item which
is at the mouse cursor location when you press it. The right button quits this
hypertext help system.
Please note that all of the information in Arcus hypertext help is contained
in printed form in the Arcus reference manual.
For more information please see ¬<Hypertext>╪298521 ¬.
|Hypertext|
Arcus Pro-Stat has it's own hypertext engine. This provides on-line help within
all Arcus software and gives you the opportunity to customise Arcus to your own
needs.
All of the help text is contained in a file called HELP.HTT. This is arranged
into chapters which are referred to as sections. Each section has a title and
all of the section titles are listed in the index. A section may contain links
to other related sections. Each link is called a link item. Link items are
shown as highlighted text and are often contained in angles e.g. <Link Item>.
In order to move to the section denoted by a link item you must first make sure
that the link item is active. On color monitors, active link items are
displayed in bright green and inactive link items are dull cyan. To make a link
item active just move through the different link items by pressing the tab key.
When you have made your chosen link item active you can select it by pressing
the enter key. Alternatively, click on any link item with the mouse and the
left hand mouse button. If you want to move back to the page you were reading
before you selected the link item then press [B]. The number of back pages
available is displayed by the [B] button at the bottom left of the screen. If
you can not find what you are looking for in the index then you can search the
entire help text by pressing [S]. This searches for any word or phrase that
you specify.
The following keys are active in Arcus Hypertext:
[] Move up one line
[] Move down one line
[Page Up] Move up one page
[Page Dn] Move down one page
[Tab] Move to the next link item
[Shift]+[Tab] Move to the previous link item
[Enter] Select the highlighted link item
[Home] Move to top of current section
[End] Move to bottom of current section
[I] Search the title index
[S] Search the entire help text for a word or phrase
[B] Move back a page
[P], [E] Edit ± send current section to log file or printer
[Q], [Esc] Quit Arcus Hypertext
The left mouse button selects the link item or the bottom menu bar item which
is at the mouse cursor location when you press it. The right button quits this
hypertext help system.
¬<Hypertext Help System Maintenance>╪300898 ¬
|Hypertext Help System Maintenance|
You can modify and/or expand Arcus Hypertext. The HELP.HTT file, which contains
all of the hypertext, is a plain ASCII text file. It can be changed using any
text processor. This is, however, a very large file which demands a good text
processor, EDIT in DOS often can not cope with this. The easiest way to
maintain HELP.HTT is to select "hypertext help system maintenance" from the
information menu. This is enables you to work through Arcus hypertext, edit
specified sections and create new ones. Your old hypertext file is saved as
HELP.BAK.
If you are planning to do a lot of hypertext maintenance in Arcus then please
aim to use a fast computer with an efficient hard disk drive. The re-indexing
procedure is time consuming on a 286 with an un-cached hard disk. A well
configured 486 with a reasonably efficient hard drive will rapidly re-index
Arcus Hypertext. Disk cache software such as SMARTDRV in MS-DOS 6 gives a
large improvement in hard disk operation.
There are only two special characters which you must remember when editing
Arcus hypertext, these are the vertical dashed line and the angle bar. The
vertical dashed line is usually at the bottom left of your keyboard to the
left of Z and is usually the shifted version of the back slash \. The vertical
dashed line has the ASCII code 124. The angle bar is near the top left hand
corner of most keyboards and is usually the shifted version of the single
opening quote `. The angle bar has the ASCII code 170. Neither of these
characters can be displayed here so let the vertical dashed line = {124} and
let the angle bar = {170}. You should also avoid the use of ASCII character
216 (╪).
To mark text as a title you must include two {124} on that line. There must
be no other text on the title line. To mark text as a link item you must
enclose it in two {170}'s. Only the first twenty characters of a title or a
link item are used for indexing and linking. Try to use link items which match
section titles exactly, this enables Arcus to do all indexing for you
automatically.
Sample of hypertext:
{124}Section 1{124}
This is a example of body text in Arcus Hypertext.
For more information please see {170}body text{170}.
{124}Body Text{124}
This is the section on body text which links to the link item in section 1.
Thus, the only restrictions on hypertext are the use of ASCII characters 124,
170, 216 and control characters such as tabs (ASCII 9). You can use any other
ASCII characters, for example, you can compose diagrams using the line drawing
charaters apart from 216 (see ¬ASCII codes╪294887 ¬).
There are no practical limits on the size of the Arcus hypertext file. If you
have a vast number of sections and a large worksheet open then you might run
into memory problems on a computer with little free memory. Otherwise you
should be able to run your own customised versions of Arcus Hypertext without
any problems.
If you teach statistical methods the please see ¬educational uses╪304017 ¬.
|Educational Uses| of Arcus Pro-Stat
Arcus Pro-Stat has been written for use by people of all levels of statistical
expertise. Some Arcus users have written their own versions of the ¬hypertext╪298521 ¬
help system to give additional explanations and exercises to their students.
Arcus is also used by many experienced statisticians. There is therefore the
potential for someone to learn statistical methods with Arcus and then go on
to practise those methods with the same package. This avoids a second learning
curve.
|Finish|
This closes the current Arcus session. If you have forgotten to save any new
or altered worksheet data then you will be prompted to do so before leaving
Arcus.
|Information|
This section provides pages of text on using Arcus in your approach to good
statistical design, analysis and presentation. There is also an interactive
statistical method selection session which covers the more simple analyses.
|Function Overview|
Here is a brief summary of the functions within the analysis section of Arcus:
¬DESCRIPTIVE STATISTICS╪80612 ¬
~~~~~~~~~~~~~~~~~~~~~~
Number, arithmetic mean, variance, standard deviation, standard error of the
mean, user defined confidence interval for the mean, geometric mean, skewness,
kurtosis, maximum, upper quartile, median, lower quartile, minimum, user
defined quantile.
¬ARITHMETICAL MANIPULATION╪78201 ¬
~~~~~~~~~~~~~~~~~~~~~~~~~
Manipulate one or several worksheet columns using your own formulae.
Transformations for proportions.
¬PICTORIAL STATISTICS╪81471 ¬
~~~~~~~~~~~~~~~~~~~~
Histogram, box and whisker, scatter, normal, survival, error bar, spread and
ladder.
¬PARAMETRIC╪87475 ¬
~~~~~~~~~~
Single sample Student t, paired Student t, unpaired Student t, F (variance
ratio), Z (normal distribution) and Shapiro-Wilk W test for non-normality.
¬NONPARAMETRIC╪98877 ¬
~~~~~~~~~~~~~
Mann-Whitney U, Wilcoxon signed ranks, Spearman's rank correlation, Kendall's
rank correlation, Cuzick's test for trend, confidence intervals for quantiles,
Kolmogorov Smirnov two sample test, Ranking and normal scores.
¬REGRESSION AND CORRELATION╪119789 ¬
~~~~~~~~~~~~~~~~~~~~~~~~~~
Simple linear, general/multiple linear, regression in groups (linearity,
differences between regression lines and covariances), polynomial (with area
under curve and back interpolation), linearized estimates (exponential,
geometric and hyperbolic) and probit analysis (also for logistic curves).
¬ANALYSIS OF VARIANCE╪158578 ¬
~~~~~~~~~~~~~~~~~~~~
One way, two way, two way with replicates/repeated measures, crossover,
Kruskal Wallis and Friedman.
¬SURVIVAL ANALYSIS╪182274 ¬
~~~~~~~~~~~~~~~~~
Kaplan-Meier product limit estimates of survival and the cumulative hazard
function (including plots), simple Berkson-Gage life tables, log-rank and
Wilcoxon tests and Wei Lachin.
¬DISTRIBUTIONS╪213522 ¬
~~~~~~~~~~~~~
Normal, chi-square, Student t, Snedecor's f, Studentized Q, binomial, poisson,
Spearman's rho and Kandall's tau.
¬CHI-SQUARE╪218665 ¬
~~~~~~~~~~
Two by two, two by k with trend, r by c with trend, McNemar's, Mantel Haenszel
and Woolf.
¬EXACT╪243294 ¬
~~~~~
Fisher's, exact (Gart) confidence intervals for two by two odds, Liddel's and
the sign test.
¬RANDOMISATION╪252007 ¬
~~~~~~~~~~~~~
Integer series, case-control pairs and case / control groups.
¬SAMPLE SIZE╪256010 ¬
~~~~~~~~~~~
For Student t tests, comparison of proportions and population studies.
¬PROPORTIONS╪262904 ¬
~~~~~~~~~~~
Single, unpaired and paired.
¬MISCELLANEOUS╪269298 ¬
~~~~~~~~~~~~~
Bayesian (test likelihoods, false result probabilities), relative risk,
risk reductions with number needed to treat and standardized mortality ratios.
¬ALGEBRAIC CALCULATOR╪288806 ¬
~~~~~~~~~~~~~~~~~~~~
Full function algebraic expression evaluator available by pressing Alt+C from
any menu or result screen.
|Benefits of Registration|
Registered users of Arcus are kept informed of developments in the Arcus project
by newsletters. Upgrades are offered to registered users at low cost and all
registered users can request new functions for Arcus.
Each Arcus registration includes a donation to a registered charity and the rest
is fed back into further research and development of Arcus. This project is to
be supported indefinitely.
If you are not a registered Arcus user then you can order your copy of the
latest version of Arcus with a clip bound manual by pressing the enter key to
select the order form. When the order form is displayed, press E and fill in
your details. You can then print out the completed order form.
¬<Order Form>╪308826 ¬
|Order Form| & INVOICE FOR ARCUS PRO-STAT STATISTICAL ANALYSIS SYSTEM
Supplier: Medical Computing, Tel UK (0)695 424 034
83, Turnpike Road, FAX UK (0)51 256 7001
Aughton,
West Lancs,
L39 3LD.
United Kingdom
Supply to:
Post code:
What is your intended use for Arcus?
If this is a site licence who is the contact for Arcus newsletters?
I require (tick one) [ ] 3.5 inch 1.4MB high density diskette
[ ] 3.5 inch 720k diskettes
[ ] 5.25 inch 360k floppy disks
I understand that I Arcus Pro-Stat version 3.0 or later requires at least a
286 processor to run [ ].
Licence fees: Quantity required: Total Price:
Single user £ 139 [ ] [ ]
Ten user £ 389 [ ] [ ]
Twenty user £ 590 [ ] [ ]
Fifty user £1200 [ ] [ ]
Large site £negotiable [ ] [ ]
Postage & Packing: £ 8 for UK [ ]
£15 for Non-UK
TOTAL [ ]
Please make all payments in pounds sterling.
Please make cheques payable to Dr Iain E. Buchan.
Official Government and University orders are accepted.
Convertible cheques in pounds sterling or US money orders are accepted.
If you have any questions then please telephone or FAX to the UK numbers
listed above.
|Reference List|
¬<Introductory Texts>╪310834 ¬───────────∙ref 1 - 3
¬<Core Reference Texts>╪311139 ¬─────────∙ref 4 - 7
¬<Other references>╪311556 ¬─────────────∙ref 8 - 31
¬<Algorithms>╪315734 ¬───────────────────∙ref A1 - A21
|Introductory Texts|
1. Petrie Aviva, Lecture Notes on Medical Statistics, Blackwell Scientific
Publications 1990.
2. Bland Martin, An Introduction to Medical Statistics, Oxford Medical
Publications 1989.
3. Colton Theodore, Statistics in Medicine, Little, Brown & Co. 1974.
|Core Reference Texts|
4. P. Armitage & G. Berry, Statistical Methods in Medical Research,
Blackwell 1987.
5 . Altman Douglas G., Practical Statistics for Medical Research, Chapman
and Hall 1991.
6. Conover W. J., Practical Nonparametric Statistics, Wiley 1980.
7. Kendall M. G., Stuart A. and Ord J. K., The Advanced Theory of
Statistics, (4th edition), London: Griffin 1983.
|Other References|
8. Fleiss J., Statistical Methods for Rates and Proportions, Wiley 1981.
9. Fleiss J., J. Chron. Diseases, 32, pp. 69 - 77, 1979.
10. Schlesselman J., Case-Control Studies, Oxford University Press 1982.
11. Gardner Martin J., Altman Douglas G., Statistics with Confidence -
Confidence Intervals and Statistical Guidelines, British Medical Journal
1989.
12. Sackett David L. et al., Clinical Epidemiology - a basic science for
clinical medicine, Little, Brown & Co. 1985.
13. Wallenstein Sylvian, Some statistical methods useful in circulation
research, Circulation Research 47(1) 1980.
14. Wetherill G. Barrie, Intermediate Statistical Methods, Chapman Hall 1981.
15. Hollander Myles, Wolfe Douglas A., Nonparametric Statistical Methods,
Wiley 1973.
16. Basic Professional Development System (Compiler 7.1), Microsoft
Corporation 1990.
17. FORTRAN Optimising Compiler (version 5.1), Microsoft Corporation 1989.
18. Finney D. J., Probit Analysis, Cambridge University Press 1971.
19. Finney D. J., Statistical Method in Biological Assay, Charles Griffin &
Co. 1978.
20. Liddell F. D. K., Simplified exact analysis of case-referent studies;
matched pairs; dichotomous exposure., J. Epidemiol. Comm. Health, 37,
82-84, 1983.
21. Shapiro S. S. & Wilk M. B., An analysis of variance test for normality.,
Biometrika, 52(3), 591 ff., 1965.
22. Miller R. G. (jnr), Simultaneous Statistical Inference, (2nd edition)
Springer-Verlag 1981.
23. Draper N. R. and Smith H., Applied Regression Analysis, (2nd edition)
New York: Wiley 1981.
24. Lawless J. F., Statistical Models and Methods for Lifetime Data, New York:
Wiley 1982.
25. Kalbfleisch J. D. and Prentice R. L., Statistical Analysis of Failure
Time Data, New York: Wiley 1980.
26. Wei L. J. and Lachin J. M., Two Sample Asymptotically Distribution Free
Tests for Incomplete Multivariate Observations, J. Am. Statist. Ass.
79, 653-661, 1984.
27. Bailey N. T. J., Mathematics, Statistics and Systems for Health, New York:
Wiley 1977.
28. Cuzick Jack, A Wilcoxon-Type Test for Trend, Stat. Med. 4, 87-89, 1985.
29. Bland Martin & Altman Douglas, Statistical Methods for Assessing the
Difference Between Two Methods of Measurement, Lancet, 307-310, 1986.
30. Dupont W. D., Power and Sample size calculations, Controlled Clinical
Trials 11, 116-128, 1990.
31. Pearson & Hartley, Biometrika tables for statisticians, 3rd Ed.,
Cambridge University Press, 1970.
32. Belsley, Kuh, Welsch, Regression Diagnostics, Wiley 1980.
33. Press W. H. et al., Numerical Recipies, The Art of Scientific Computing,
2rd Ed., Cambridge University Press, 1992.
34. Ross J. G., NonLinear Estimation, Springer-Verlag New York 1990.
35. Gart J. J. & Nam J., Approximate interval estimation of the ratio of
binomial parameters: a review and corrections for skewness, Biometrics 44,
323-338, 1988.
36. Sackett David L. et al., Interpretation of diagnostic data (5), Canadian
Medical Association Journal, 129, 947-975, 1983.
37. Laupacis A., Sackett D. L., Roberts R. S., An assessment of clinically
useful measures of the consequences of treatment, New England J. Med.,
318(26), 1728-33, 1988.
38. Haynes Brian & Sackett David, Personal communications on diagnosic and
treatment outcome statistics, McMaster University, 1993.
39. Peto R., Pike M. C., Armitage P., Breslow N. E., Cox D. R., Howard S. V.,
Mantel N., McPherson K., Peto J., Smith P. G., Design and analysis of
randomised clinical trials requiring prolonged observation of each patient.
Part I: Introduction and design, Br. J. Cancer, 34, 585-612, 1976.
40. Peto R., Pike M. C., Armitage P., Breslow N. E., Cox D. R., Howard S. V.,
Mantel N., McPherson K., Peto J., Smith P. G., Design and analysis of
randomised clinical trials requiring prolonged observation of each patient.
Part II: Analysis and Examples, Br. J. Cancer, 34, 585-612, 1976.
Published |Algorithms|
A1 Pike M. C., Hill I. D., Algorithm 291, Logarithm of the Gamma Function,
Comm. Ass. Comput. Mach., 9, 684 1966.
A2 Macleod Allan J., AS 245, A Robust and Reliable Algorithm for the
Logarithm of the Gamma Function, Appl. Statist. 38(2) 1989.
A3 Hill I. D., AS 66, The Normal Integral, Appl. Statist. 22(3) 1973.
A4 Odeh R. E., Evans J. O., AS 70, Percentage Points of the Normal
Distribution, Appl. Statist. 23 1974.
A5 Best D. J., Roberts D. E., AS 91, The Percentage Points of the Chi²
Distribution, Appl. Statist. 24(3) 1975.
A6 Dinneen L. C., Blakesley B. C., AS 62, A Generator for the Sampling
Distribution of the Mann-Whitney U Statistic, Appl. Statist. 22(2) 1973.
A7 Majumder K. L., Bhattcharjee G. P., AS 63, The Incomplete Beta Integral,
Appl. Statist. 22(3) 1973.
A8 Majumder K. L., Bhattcharjee G. P., AS 64, Inverse of the Incomplete Beta
Function Ratio, Appl. Statist. 22(3) 1973.
A9 Cran G. W., Martin K. J., Thomas G. E., R19 and AS 109 further to AS 63
and AS 64, Appl. Statis. 26(1) 1977.
A10 Berry K. J., Mielke P. W., Cran G. W., R83 further to AS 64, Appl.
Statist. 39(2) 1990.
A11 Lund R. E., Lund J. R., AS 190, Probabilities and Upper Quantiles for the
Studentized Range, Appl. Statist. 34 1983.
A12 Royston J. P., R69 further to AS 190, Appl. Statist. 1987
A13 Best D. J., Roberts D. E., AS 89, Upper Tail Probabilities of Spearman's
Rho, Appl. Statist. 24(3) 1975.
A14 Best D. J., Gipps P. G., AS 71, Upper Tail Probabilities of Kendall's Tau,
Appl. Statist. 23(1) 1974.
A15 Thomas Donald G., AS 36, Exact Confidence Limits for the Odds Ratio in a
Two by Two Table, Appl. Statist. 20(1) 1971.
A16 Shea B. L., AS 239, Chi-square and incomplete gamma integral, Appl.
Statist. 37(3) 1988.
A17 Royston J. P., AS 181, The W Test for Normality, Appl. Statist. 31(2)
1982.
A18 Royston J. P., AS 177.3, Expected Normal Order Statistics (Approximate),
Appl. Statist. 31(2), 1982.
A19 Harding E. F., An Efficient Minimal Storage Procedure for Calculating the
Mann-Whitney U, Generalised U and Similar Distributions, Appl. Statist.
33 1983.
A20 Neumann N., Some Procedures for Calculating the Distributions of
Elementary Nonparametric Test Statistics, Statistical Software
Newsletter, 14(3) 1988.
A21 Makuch Robert et. al., AS 262, A Two Sample Test for Incomplete
Multivariate Data, Appl. Statist. 40(1), 1991.